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Preface 


This book is intended to be a second course in probability for un- 
dergraduate and graduate students in statistics, mathematics, engi- 
neering, finance, and actuarial science. It is a guided tour aimed 
at instructors who want to give their students a familiarity with 
some advanced topics in probability, without having to wade through 
the exhaustive coverage contained in the classic advanced probabil- 
ity theory books (books by Billingsley, Chung, Durrett, Breiman, 
etc.). The topics covered here include measure theory, limit the- 
orems, bounding probabilities and expectations, coupling, Stein’s 
method, martingales, Markov chains, renewal theory, and Brownian 
motion. 

One noteworthy feature is that this text covers these advanced 
topics rigorously but without the need of much background in real 
analysis; other than calculus and material from a first undergraduate 
course in probability (at the level of A First Course in Probability, 
by Sheldon Ross), any other concepts required, such as the definition 
of convergence, the Lebesgue integral, and of countable and uncount- 
able sets, are introduced as needed. 

The treatment is highly selective, and one focus is on giving al- 
ternative or non-standard approaches for familiar topics to improve 
intuition. For example we introduce measure theory with an exam- 
ple of a non-measurable set, prove the law of large numbers using 
the ergodic theorem in the very first chapter, and later give two al- 
ternative (but beautiful) proofs of the central limit theorem using 
Stein’s method and Brownian motion embeddings. The coverage of 
martingales, probability bounds, Markov chains, and renewal theory 
focuses on applications in applied probability, where a number of 
recently developed results from the literature are given. 

The book can be used in a flexible fashion: starting with chapter 
1, the remaining chapters can be covered in almost any order, with 
a few caveats. We hope you enjoy this book. 


About notation 


Here we assume the reader is familiar with the mathematical notation 
used in an elementary probability course. For example we write 
X ~ U(a,b) or X =g U(a,b) to mean that X is a random variable 


having a uniform distribution between the numbers a and b. We use 
common abbreviations like N(u,07) and Poisson(A) to respectively 
mean a normal distribution with mean p and variance o7, and a 
Poisson distribution with parameter A. We also write I4 or I{A} to 
denote a random variable which equals 1 if A is true and equals 0 
otherwise, and use the abbreviation “iid” for random variables to 
mean independent and identically distributed random variables. 
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Chapter 1 


Measure Theory and Laws 
of Large Numbers 


1.1 Introduction 


If you’re reading this you've probably alreadv scen many different 
types of random variables, and have applied the usual theorems and 
laws of probability to them. We will, however. show vou there are 
some seemingly innocent random variables for which none of the laws 
of probability apply. Measure theory. as it applies to probability, is 
a theory which carefully describes the types of random variables the 
laws of probability apply to. This puts the whole field of probability 
and statistics on a mathematically rigorous foundation. 

You are probably familiar with some proof of the famous strong 
law of large numbers, which asserts that the Jong-run average of iid 
random variables converges to the expected value. One goal of this 
chapter is to show you a beautiful and more general alternative proof 
of this result using the powerful ergodic theorem. In order to do this. 
we will first. take you on a brief tour of measure theory and introduce 
you to the dominated convergence theorem. one of measure theory's 
most famous results and the key ingredient we need. 

In section 2 we construct an event. called a non-measurable event, 
to which the laws of probability don’t apply. In section 3 we introduce 
the notions of countably and uncountably infinite sets. and show 
you how the elements of some infinite sets cannot be listed in a 
sequence. In section 4 we define a probabilitv space. and the laws 


ot probability which apply to them. In section 0 we introduce the 
concept of a measurable random variable, and in section 6 we define 
the expected value in terins of the Lebesgue integral. In section 7 we 
illustrate aud prove the dominated convergence theorem, in section 8 
we discuss convergence in probability and distribution, and in section 
9 we prove 0-1 laws, the ergodic theorem, and use these to obtain 
the strong law of large numbers. 


1.2 A Non-Measurable Event 


Consider a circle having radius equal to one. We say that two points 
on the edge of the circle are in the same family if you can go from one 
point to the other point by taking steps of length one unit around 
the edge of the circle. By this we mean each step you take moves 
vou an angle of exactly one radian degree around the circle, and you 
are allowed to keep looping around the circle in either direction. 


Suppose each family elects one of its members to be the head of 
the family. Here is the question: what is the probability a point X 
selected uniformly at random along the edge of the circle is the head 
of its family? It turns out this question has no answer. 

The first thing to notice is that each family has an infinite number 
of family members. Since the circumference of the circle is 27, you 
can never get back to your starting point by looping around the circle 
with steps of length one. If it were possible to start at the top of the 
circle and get back to the top going a steps clockwise and looping 
around b times, then you would have a = b2z for some integers a, ), 
and hence m = a/(2b). This is impossible because it’s well-known 
that m7 is an irrational number and can’t be written as a ratio of 
integers. 

It may seem to you like the probability should either be zero or 
one, but we will show you why neither answer could be correct. It 
doesn’t even depend on how the family heads are elected. Define 
the events A = {X is the head of its family}, and Aj = {X is 7 
steps clockwise from the head of its family}, and B; = {X is 7 steps 
counter-clockwise from the head of its family}. 

Since X was uniformly chosen, we must have P(A) = P(Aj) = 
P(B;). But since every family has a head, the sum of these proba- 


bilities should equal 1, or in other words 


1 = P(A) + 5 (PAs) + P(B;)). 


i=] 


Thus if ¢ = P(A) we get 1 = ¢ + 50°, 22, which has no solution 
where 0< x < 1. This means it’s impossible to compute P(A), and 
the answer is neither 0 nor 1, nor any other possible number. The 
event A is called a non-measurable event, because you can’t measure 
its probability in a consistent way. 

What’s going on here? It turns out that allowing only one head 
per family, or any finite number of heads, is what makes this event 
non-measurable. If we allowed more than one head per family and 
gave everyone a 50% chance, independent of all else, of being a head 
of the family, then we would have no trouble measuring the proba- 
bility of this event. Or if we let everyone in the top half of the circle 
be a family head, and again let families have more than one head, 
the answer would be easy. Later we will give a careful description of 
what types of events we can actually compute probabilities for. 

Being allowed to choose exactly one family head from each fam- 
ily requires a special mathematical assumption called the axiom of 
choice. This axiom famously can create all sorts of other logical 
mayhem, such as allowing you to break a sphere into a finite number 
of pieces and rearrange them into two spheres of the same size (the 
“Banach-Tarski paradox” ). For this reason the axiom is controversial 
and has been the subject of much study by mathematicians. 


1.3. Countable and Uncountable Sets 


You may now be asking yourself if the existence of a uniform random 
variable X ~ U(0,1) also contradicts the laws of probability. We 
know that for all z, P(X = x) = 0, but also P(O < X < 1) = 1. 
Doesn’t this give a contradiction because 


PO<XS)S= )_ PX Sz) =0? 
xre{0,1] 


Actually, this is not a contradiction because a summation over an 
interval of real numbers does not make any sense. Which values of 
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would you use for the first few terms in the sum? The first term in 
the sum could use x = 0, but it’s difficult to decide on which value 
of x to use next. 


In fact, infinite sums are defined in terms of a sequence of finite 


sums: 
e.@) n 
) L; = lim ) ie 
n—00 4 
t=1 t=). 


and so to have an infinite sum it must be possible to arrange the 
terms in a sequence. If an infinite set of items can be arranged in a 
sequence it is called countable, otherwise it is called uncountable. 

Obviously the integers are countable, using the sequence 0,-1,+1.,- 
2,+2,..... The positive rational numbers are also countable if you 


express them as a ratio of integers and list them in order by the sum 
of these integers: 


12132143241 
Pie io 3 eo sae 
The real numbers between zero and one, however, are not count- 

able. Here we will explain why. Suppose somebody thinks they have 
a method of arranging them into a sequence 21, 72,..., where we ex- 
press them as 2; = S32, dij107*, so that di; € {0,1,2,...,9} is the 
ith digit after the decimal place of the 7th number in their sequence. 
Then you can clearly see that the number 


© @) 
y= > (1+ {dis = 1})10%, 
ee | 


where I{A} equals 1 if A is true and 0 otherwise, is nowhere to be 
found in their sequence. This is because y differs from x; in at least 
the zth decimal place. and so it is different from every number in their 
sequence. Whenever someone tries to arrange the real numbers into 
a sequence, this shows that they will always be omitting some of 
the numbers. This proves that the real numbers in any interval are 
uncountable, and that you can’t take a sum over all of them. 

So it’s true with X ~ U(0,1) that for any countable set A we 
have P(X € A) = Do,-, P(X = 2) = 0, but we can’t simply sum 
up the probabilities like this for an uncountable set. There are, 
however, some examples of uncountable sets A (the Cantor set, for 
example) which have P(X € A) = 0. 
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1.4 Probability Spaces 


Let 2 be the set of points in a sample space. and let F be the 
collection of subsets of Q for which we can calculate a probability. 
These subsets are called events. and can be viewed as possible things 
which could happen. If we let P be the function which gives the 
probability for any event in F, then the triple (0,7. P) is called a 
probability space. The collection F is usually what is called a sigma 
field (also called a sigma algebra). which we define next. 


Definition 1.1 The collection of sets F is a siqma field. or ao-field, 
if it has the following three properties: 


1Q€F 
2 ACF mAECF 


3. Aj, Ag,...€ Fo UA: CF 


These properties say you can calculate the probabilitv of the 
whole sample space (property 1), the complement of anv event (prop- 
erty 2), and the countable union of anv sequence of events (property 
3). These also imply that vou can calculate the probability of the 
countable intersection of any sequence of events. since N7C,A; = 
(UZ, A”. 

To specify a o—field, people typically start with a collection of 
events A and write o(A) to represent the smallest o-field containing 
the collection of events A. Thus o(A) is called the o-field “generated” 
by A. It is uniquely defined as the intersection of all possible sigma 
fields which contain A, and in exercise 2 below you will show such 
an intersection is always a sigma field. 


Example 1.2 Let Q = {a,b,c} be the sample space and let A = 
{{a, b}, {c}}. Then A is not a o—field because {a.b.c} @ A. but 
o(A) = {{a,b,c}, {a,b}, {ce}, d}, where 6 = 1° is the empty set. 


Definition 1.3 A probability measure P is a function, defined on the 
sets in. a sigma field, which has the following three properties: 


1. P(Q) =1, and 
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2. P(A) > 0, and 
3. P(U%,Ai) = S072, P(Ai) if Vi 47 we have A;N A; = ¢. 


These imply that probabilities must be between zero and one, and 
say that the probability of a countable union of mutually exclusive 
events is the sum of the probabilities. 


Example 1.4 Dice. If you roll a pair of dice, the 36 points in the 
sample space are 2 = {(1, 1), (1, 2), ..., (5, 6), (6,6)}. We can let F be 
the collection of all possible subsets of 2, and it’s easy to see that it 
is a sigma field. Then we can define 


where |A| is the number of sample space points in A. Thus if A = 
{(1, 1), (3,2)}, then P(A) = 2/36, and it’s easy to see that P is a 
probability measure. 


Example 1.5 The unit interval. Suppose we want to pick a uniform 
random number between zero and one. Then the sample space equals 
Q = (0, 1], the set of all real numbers between zero and one. We can 
let F be the collection of all possible subsets of , and it’s easy to 
see that it is a sigma field. But it turns out that it’s not possible 
to put a probability measure on this sigma field. Since one of the 
sets in F would be similar to the set of heads of the family (from 
the non- ineasurable event example), this event cannot have a prob- 
ability assigned to it. So this sigma field is not a good one to use in 
probability. 


Example 1.6 The unit interval again. Again with 2 = [0,1], sup- 
pose we use the sigma field F =o({r}z¢q), the smallest sigma field 
generated by all possible sets containing a single real number. This 
is a nice enough sigma field, but it would never be possible to find 
the probability for some interval, such as [0.2,0.4]. You can’t take 
a countable union of single real numbers and expect to get an un- 


countable interval somehow. So this is not a good sigma field to 
use. 


1.4 Probability Spaces 1d 


So if we want to put a probability measure on the real numbers 
between zero and one, what sigma field can we use? The answer is 
the Borel sigma field B, the smallest sigma field generated by all 
intervals of the form |z,y) of real numbers between zero and one: 
B =o(|z, y)z<yen)-The sets in this sigma field are called Borel sets. 
We will see that most reasonable sets you would be interested in are 
Borel sets, though sets similar to the one in the “heads of the family” 
example are not Borel sets. 


We can then use the special probability measure, which is called 
Lebesgue measure (named after the French mathematician Henri 
Lebesgue), defined by P([z,y)) = y— zz, for0 < « < y < 1, to 
give us a uniform distribution. Defining it for just these intervals is 
enough to uniquely specify the probability of every set in B (This 
fact can be shown to follow from the extension theorem — Theorem 
1.62, which is discussed later). And actually, you can do almost all 
of probability starting from just a uniform(0,1) random variable, so 
this probability measure is pretty much all you need. 


Example 1.7 If B is the Borel sigma field on [0,1], is {.5} € B? Yes 
because {.5} = N%%,[.5,.5+ 1/2). Also note that {1} = [0,1)° € B. 


Example 1.8 If B is the Borel sigma field on [0,1], is the set of ra- 
tional numbers between zero and one @ € B? The argument from 
the previous example shows {zx} € B for all z, so that each number 
by itself is a Borel set, and we then get Q € B since Q is countable 
union of such numbers. Also note that this then means Q° € B, so 
that the set of irrational numbers is also a Borel set. 


Actually there are some Borel sets which can’t directly be written 
as a countable intersection or union of intervals like the preceding, 
but you usually don’t run into them. 


From the definition of probability we can derive many of the 
famous formulas you may have seen before such as 


P(AU B) = P(A) + P(B) — P(ANB) 
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and, extending this by induction, 


P(UR Ai) = > P(Ai) — 0 P(Ai Ay) 
t t<j 
+ S> P(A;N AJM Ag)- 
i<j<k 


vee (—1)"t1 P(A, MN Ag ++M An), 


where the last formula is usually called the inclusion-exclusion for- 
mula. Next we next give a couple of examples applying these. In 
these examples the sample space is finite, and in such cases unless 
otherwise specified we assume the corresponding sigma field is the 
set of all possible subsets of the sample space. 


Example 1.9 Cards. A deck of n cards is well shuffled many times. 
(a) What’s the probability the cards all get back to their initial po- 


sitions? (b) What’s the probability at least one card is back in its 
initial position? 


Solution. Since there are n! different ordering for the cards and 
all are approximately equally likely after shuffling, the answer to 
part (a) is approximately 1/n!. For the answer to part (b), let 
A; = {card i is back in its initial position} and let A = U?2,A; 
be the event at least one card back in its initial position. Because 
P(A, N Ai, N...N Aj) = (n — k)!/n!, and because the number of 
terms in the kth sum of the inclusion-exclusion formula is (2), we 
have 


n! 


P(A) = dye @ (n—k) 


for large n. @ 


Example 1.10 Coins. If a fair coin is flipped n times, what is the 
chance of seeing at least k heads in row? 
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Solution. We will show you that the answer is 


(n+1)/(k+1) 
S; (ay (Gg a a Cee 


m=1 


When we define the event A; = {a run of a tail immediately followed 
by & heads in a row starts at flip i}, and Ap = {the first k flips are 
heads}, we can use the inclusion-exclusion formula to get the answer 
above because 


P(at least k heads in row) = P(U;2; ae 1A;) 


and 
0 if flips for any events overlap 
P(Aj, Ain +++ Aim) = 2) otherwise and 7; > 0 
g—mk+1)+1 otherwise and 7; = 0 


and the number of sets of indices 1; < ig < -++ < i, where the runs 
do not overlap equals (a) if i; > 0 (imagine the k heads in each 
of the m runs are invisible, so this is the number of ways to arrange 


m tails in n — mk visible flips) and ( =) if7;=0. @ 


An important property of the probability function is that it is a 
continuous function on the events of the sample space 2. To make 


this precise let An,n > 1, be a sequence of events, and define the 
event liminf A, as 


lim inf A, = UPL, NP, Ai 


Because liminf A, consists of all outcomes of the sample space that 
are contained in Nf, A; for some n, it follows that lim inf A, consists 
of all outcomes that are contained in all but a finite number of the 
events A,,n > 1. 


Similarly, the event limsup Ay is defined by 
limsup Ap = NPL, UP, Ai 


Because lim sup A, consists of all outcomes of the sample space that 
are contained in U2, A; for all n, it follows that lim sup A, consists of 
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all outcomes that are contained in an infinite number of the events 
An,n > 1. Sometimes the notation {Ay i.o.} is used to represent 
lim sup Ay, where the “i.o.” stands for “infinitely often” and it means 
that an infinite number of the events A, occur. 


Note that by their definitions 
liminf A, C limsup A, 


Definition 1.11 Jf limsup A, = liminf A,, we say that lim A, ez- 
ists and define it by 


lim An = limsup A, = liminf A, 


Example 1.12 (a) Suppose that A,,n > 1 is an increasing sequence 
of events, in that Ay; C Anyi, n > 1. Then N92, A; = An, showing 
that 


Also, Uf, Ai = US, Ai, showing that 


lim sup An = UR2) An 


Hence, 
Tr 


(b) If A,,n > 1 is a decreasing sequence of events, in that Ani; C 
An, 2 > 1, then it similarly follows that 


n 


The following result is known as the continuity property of prob- 
abilities. 


Proposition 1.13 Jf lim, A, = A, then lim, P(A,) = P(A) 


Proof: We prove it first for when A, is either an increasing or de- 
creasing sequence of events. Suppose An C Anyi, n > 1. Then, with 
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Ao defined to be the empty set, 


P(lim An) = P(U2,Ai) 
P(UR) Ai(Uj=1 4s)*) 
= Pre =A; Aj_1) 


= SP (A; Aj_;) 
i=1 


nr 
= dD PAL) 


= lim P(ULiA:Af_1) 
= lim P(UiR1Ai) 
= lim P(An) 

n-—-CO 


Now, suppose that An4ii C An, n > 1. Because Af is an increasing 
sequence of events, the preceding implies that 


P(UZ, Af) = lim n P(A, ) 
or, equivalently 


P((N$2,As)°) = 1 — lim P(An) 


or 


P(Nf2, Ai) = lim P(An) 


which completes the proof whenever A, is a monotone sequence. 
Now, consider the general case, and let B, = UP, A;. Noting that 
Bn+i1 C By, and applying the preceding yields 


P(limsup An) = P(N, Bn) 
7 jim P(Bn) (1.1) 


Also, with C, = =N&,, Ai, 


P(liminf A,) = P(US,C,) 
= lim P(Cy) (1.2) 
nN— > OO 
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because C, C Chai. But 


Cy = Ne, Ai Cc Ax Cc UA; =.B, 


showing that 
P(Cp) < P(An) < P(Bn). (1.3) 


Thus, if liminf A, = limsup Ay = lim An, then we obtain from (1.1) 
and (1.2) that the upper and lower bounds of (1.3) converge to each 
other in the limit, and this proves the result. & 


1.5 Random Variables 


Suppose you have a function X which assigns a real number to each 
point in the sample space 22, and you also have a sigma field F. We 
say that X is an F—measurable random variable if you can com- 
pute its entire cumulative distribution function using probabilities of 
events in F or, equivalently, that you would know the value of X if 
you were told which events in F actually happen. We define the no- 
tation {X <2} = {w EN: X(w) < x}, so that X is F—measurable 
if {X < x} € F for all x. This is often written in shorthand notation 
as X € F. 


Example 1.14 9 = {a,b,c}, A = {{a,b,c}, {a,b}, {c}, d}, and we 
define three random variables X,Y, Z as follows: 


Which of the random variables X,Y, and Z are A - measurable? 
Well since {Y < 1} = {a} ¢ A, then Y is not A—measurable. For 
the same reason, Z is not A—measurable. The variable X is A 
- measurable because {X < 1} = {a,b} € A, and {X < 2} = 
{a,b,c} € A. In other words, you can always figure out the value of 


X using just the events in A, but you can’t always figure out the 
values of Y and Z. 
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Definition 1.15 For a random variable X we define o(X) to be the 
smallest sigma field which makes X measurable with respect to o(X). 
We read it as “the sigma field generated by X.” 


Definition 1.16 For random variables X,Y we say that X is Y - 
measurable if X € o(Y). 


Example 1.17 In the previous example, is Y € o(Z)? Yes, because 
o(Z) = {{a,b,c}, {a}, {a,b}, {b}, {b,c}, {c}, {c, a}, d}, the set of all 
possible subsets of 2. Is X € o(Y)? No, since {X < 1} = {a,b} ¢ 
o(Y) = {{a, b,c}, {b, c}, {a}, o}. 

To see why o(Z) is as given, note that {Z < 1} = {a}, {7 < 4} = 
{a, c}, {Z< 7} = {a, b,c}, {a}° = {b,c}, {a, b}° = {c}, {a} U {c} = 
{a,c}, {a, b, c}© = ¢, and {a,c}* = {b}. 


Example 1.18 Suppose X and Y are random variables taking values 
between zero and one, and are measurable with respect to the Borel 
sigma field B. Is Z = X +Y also measurable with respect to B? 
Well, we must show that {Z < z} € B for all z. We can write 


{X +Y > z} = UgeQ({X > a} N{Y > z— g}), 


where Q is the set of rational numbers. Since {X > q} € B, {Y > 
z — q} € B, and Q is countable, this means that {X + Y < z} = 
{X +Y > z}° € B and thus Z is measurable with respect to B. 


Example 1.19 The function F(z) = P(X < z) is called the distribu- 
tion function of the random variable X. If x, | x then the sequence 
of events An = {X < zn}, n > 1, is a decreasing sequence whose 
limit is 
lim An =MnAn = {X < x} 
Consequently, the continuity property of probabilities yields that 
F(z) = lim F(a) 


showing that a distribution function is always right continuous. On 
the other hand, if z, f x, then the sequence of events An = {X < 
In}, n > 1, is an increasing sequence, implying that 


Jim F(t) = P(UnAn) = P(X <2) = F(a) - P(X =2) 
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Two events are independent if knowing that one occurs does not 
change the chance that the other occurs. This is formalized in the 
following definition. 


Definition 1.20 Sigma fields F,,...,F, are independent if whenever 
A; € F, fori=1,...,n, we have P(N, Ai) = []j_, P(Ai). 


Using this we say that random variables X,,..., Xp are independent 
if the sigma fields o(X,),...,0(Xn) are independent, and we say 
events A;,...,An are independent if I4,,...,[4, are independent 


random variables. 


Remark 1.21 One interesting property of independence is that it’s 
possible that events A, B,C are not independent even if each pair of 
the events are independent. For example if we make three indepen- 
dent flips of a fair coin and let A represent the event exactly 1 head 
comes up in the first two flips, let B represent the event exactly 1 
head comes up in the last two flips, and let C represent the event 
exactly 1 head comes up among the first and last flip. Then each 
event has probability 1/2, the intersection of each pair of events has 
probability 1/4, but we have P(ABC) = 0. 


In our next example we derive a formula for the distribution of 
the convolution of geometric random variables. 


Example 1.22 Suppose we have n coins that we toss in sequence, 
moving from one coin to the next in line each time a head appears. 
That is we continue using a coin until it lands heads, and then we 
switch to the next one. Let X; denote the number of flips made 
with coin 7. Assuming that all coin flips are independent and that 
each lands heads with probability p, we know from our first course in 
probability that X; is a geometric random variable with parameter p, 
and also that the total number of flips made has a negative binomial 
distribution with probability mass function 


PUXi +. Xn=m)=(™7T)ea-pyr", mon 
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The probability mass function of the total number of flips when each 


coin has a different probability of landing heads is easily obtained 
using the following proposition. 


Proposition 1.23 If X1,...,Xn are independent geometric random 


variables with parameters pi,...,Pn, where p; # p; if 1 # J, then, 
with qiy=1-—p;, fork >n-1 


nr 
P(Xi +++ Xa>k) => ag TT LE. 
i=] wei PI Pi 


Proof. We will prove Ayn = P(X1+---+ Xn > k) is as given 
above using induction on k +n. Since clearly A) = qi, we will 
assume as our induction hypothesis that A; is as given above for 
alli+j<k+n. Then by conditioning on whether or not the event 
{Xn > 1} occurs we get 


Akn = Qn Ak-1,n = PnAk-1i.n—- 1 


nr 
-_ Pj . 
not TL 2G tt EP TY 


Se Pj 


wi «jg PS OP 


which completes the proof by induction. m 


1.6 Expected Value 


A random variable X is continuous if there is a function f, called 
its density function, so that P(X < zr) = —_ f(t)dt for all x. A 
random variable is discrete if it can only take a countable number 
of different values. In elementary textbooks you usually see two 
separate definitions for expected value: 


E[X] = >, VaP(X = 2;) if X is discrete 
~ faf(x)dz — if X is continuous with density f. 


But it’s possible to have a random variable which is neither con- 
tinuous nor discrete. For example, with U ~ U(0,1), the variable 
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X = UTIyys.s5 is neither continuous nor discrete. It’s also possible to 
have a sequence of continuous random variables which converges to a 
discrete random variable — or vice versa. For example, if X, = U/n, 
then each X,, is a continuous random variable but limp... Xp is a 
discrete random variable (which equals zero). This means it would 
be better to have a single more general definition which covers all 
types of random variables. We introduce this next. 

A simple random variable is one which can take on only a finite 
number of different possible values, and its expected value is defined 
as above for discrete random variables. Using these, we next define 
the expected value of a more general non-negative random variable. 
We will later define it for general random variables X by expressing it 
as the difference of two nonnegative random variables X = X*—X~, 
where zt = max(0,z) and x~ = max(—z,0). 


Definition 1.24 If X > 0, then we define 


E[X| = sup EY}. 


all simple variables Y<X 


We write Y < X for random variables X,Y to mean P(Y < 
X) = 1, and this is sometimes written as “Y < X almost surely” 
and abbreviated “Y < X as.”. For example if X is nonnegative 
and a > 0 then Y = alxs, is a simple random variable such that 
Y < X. And by taking a supremum over “all simple variables” 
we of course mean the simple random variables must be measurable 
with respect to some given sigma field. Given a nonnegative random 
variable X, one concrete choice of simple variables is the sequence 
Y, = min(|2”.X | /2",n), where |x| denotes the integer portion of zx. 
We ask you in exercise 17 at the end of the chapter to show that 
Y, t X and E[X] = lim, E[Y,]. 

Another consequence of the definition of expected value is that if 
Y < X, then E[Y] < E[X]. 


Example 1.25 Markov’s inequality. Suppose X > 0. Then, for any 


a > 0 we have that alxy>q < X. Therefore, Elaly>a] < E[X] or, 
equivalently, 


P(X >a) < E[X]/a 


which is known as Markov’s inequality. 
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Example 1.26 Chebyshev’s inequality. A consequence of Markov’s 
inequality is that for a > 0 


P(|X| > a) = P(X? > a’) < E[X?}/a? 
a result known as Chebyshev’s inequality. 


Given any random variable X > 0 with E[X] < oo, and any 
€ > 0, we can find a simple random variable Y with E[X]-—e < 
E{Y] < E[X]. Our definition of the expected value also gives what 
is called the Lebesgue integral of X with respect to the probability 
measure P, and is sometimes denoted E[X] = { XdP. 

So far we have only defined the expected value of a nonnegative 
random variable. For the general case we first define Xt = XIx30 
and X~ = —XIx<o so that we can define E[X] = E[X*] — E[X7], 
with the convention that E[X] is undefined if E[X*] = E[X~] = co. 


Remark 1.27 The definition of expected value covers random vari- 
ables which are neither continuous nor discrete, but if X is continu- 
ous with density function f it is equivalent to the familiar definition 
E[X] = { xf(x)dx. For example when 0 < X < 1 the definition of 
the Riemann integral in terms of Riemann sums implies, with || 
denoting the integer portion of z, 


(i+1)/n 


[ xf(x)dx = jim ale xf (x)dz 


+1 
Bie 3 


lim ha i/n<X< =) 
i=0 
jim E{l[nX]/n] 


< E[X], 


a 


where the last line follows because |[nX|/n < X is a simple random 
variable. 


26 Chapter 1 Measure Theory and Laws of Large Numbers 


Using that the density function g of 1 — X is g(x) = f(1—2), we 
obtain 


1—E[X] = E1—X] 


if xf(1—2x)dz 


IV 


] 
i (1-2) f(a)de 
0 


1— [ xf (x)dz. 


Remark 1.28 At this point you may think it might be possible to 
express any random variable as sums or mixtures of discrete and 
continuous random variables, but this is not true. Let X ~ U(0,1) 
be a uniform random variable, and let d; € {0,1,2,...,9} be the ith 
digit in its decimal expansion so that X = )>%°, d;10~*. The random 
variable Y = 5-%, min(1,d;)10~ is not discrete and has no intervals 
over which it is continuous. This variable Y can take any value 
(between zero and one) having a decimal expansion which uses only 
the digits 0 and 1 —- a set of values C' called a Cantor set. Since 
C' contains no intervals, Y is not continuous. And Y is not discrete 
because C’ is uncountable — every real number between zero and 
one, using its base two expansion, corresponds to a distinct infinite 
sequence of binary digits. 

Another interesting fact about a Cantor set is, although C’ is 
uncountable, that P(X € C’) = 0. Let C; be the set of real numbers 
between 0 and 1 which have a decimal expansion using only the 
digits 0 and 1 up to the ith decimal place. Then it’s easy to see that 
P(X € C;) = .2' and since P(X € C) < P(X € C;) = .2' for any i, 
we must have P(X € C) = 0. The set C is called an uncountable set 
having measure zero. 


Proposition 1.29 If E|X|, E|Y| < co then 
(a) E [aX + b| = aE[X] + b for constants a, b, 
(b) E[X+Y)= E[X)+E[Y]. 


Proof. In this proof we assume X,Y > 0,a > 0, and b = 0. The 
general cases will follow using E[X+Y] = E[Xt+Yt]—-E[X-+Y7], 


Eib+ X|= sup E[Y]|= sup E[b+Y]= 06+ sup EY] = 64+ E[X], 
Y<b4+X Y<x Y<Xx 
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and —aX +b=a(—X) +b. 
For part (a) if X is simple we have 


ElaX] =) axP(X = x) = aE |X], 


and since for every simple variable Z < X there corresponds another 
simple variable aZ < aX, and vice versa, we get 


E[aX] = ue E[aZ]| = a ak|Z| = aE[X]| 
aZ< < 


where the supremums are over simple random variables. 
For part (b) if X,Y are simple we have 


E[X+Y]= 5 2P(X+Y =2) 


=\.2 )) PXSEY =7) 


z Lyy:r+y=z 


=) Dd @+yP(X=2,Y =y) 


z nye+y=z 

= Le +y)P(X =2,Y =y) 

=  ePtX =2,Y =y)+ DvP(x =2,Y =y) 
= 5 =P(x =2)+ uP = y) 

= ere + E[Y], 


and applying this in the second line below we get 
E{[X|)+ E[Y]= sup E[A]+ E[B) 
A<X,B<Y 
= sup E[A+B] 


A<X,B<Y 


< sup £[A| 
A<X+Y 


= E[X +Y], 
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where the supremums are over simple random variables. We then 
use this inequality in the third line below of 


E(min(X + Y,n)| = 2n — E[2n — min(X + Y,n)| 
< 2n — E[n — min(X,n) +n — min(Y,n)] 
< 2n — E[n — min(X,n)| — E[n — min(Y, n)] 
= E[min(X,n)] + E[min(Y, n)] 
< E[X] + EY], 


and we use (a) in the first and fourth lines and min(X + Y,n) < 
min(X,n) + min(Y,n) in the second line. 

This means for any given simple Z < X + Y we can pick n larger 
than the maximum value of Z so that E[Z] < E[min(X + Y,n)] < 
E|X]+ E|Y], and taking the supremum over all simple Z < X + Y 
gives E[X + Y] < E[X] + E[Y] and the result is proved. mg 


Proposition 1.30 If X is a non-negative integer valued random vari- 
able, then 


le @) 
E[X] = )_ P(X >n). 
n=0 
Proof. Since E[X] = p, + 2po + 3p3 + 4p4... (see problem 6), we 
re-write this as 
E[X] =pi + po + px + pa ... 
+ po + p3 + pa ... 


+ p3 + pa ... 
= ae £7 ee 


and notice that the columns respectively equal p;, 2p9, 3p3... while 
the rows respectively equal P(X > 0), P(X >1),P(X >2).... = 


Example 1.31 With X;, X2... independent U(0, 1) random variables, 
compute the expected value of 


N=min}n: Sos > ih 


t=1 
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Solution. Using E[N] = (7°, P(N > n), and noting that 
P(N >0)=P(N >1)=1, 


and 


1 1—2) 1~—2r1—-—2x2 1—21-—2%2-—--XIn-1 
P(N >n) = | i J af dry--- dx, 
0 V0 0 0 


= 1/nl, 


we get E[N] =e. @ 


1.7 Almost Sure Convergence and the Dominated 
Convergence Theorem 


For a sequence of non-random real numbers we write 7, — x or 
limMn—oo In = x if for any € > O there exists a value n such that 
\tm — z| < e for all m > n. Intuitively this means eventually the 
sequence never leaves an arbitrarily small neighborhood around z. It 
doesn’t simply mean that you can always find terms in the sequence 
which are arbitrarily close to z, but rather it means that eventually 
all terms in the sequence become arbitrarily close to z. When zn — 
oo, it means that for any k > O there exists a value n such that 
Im > k for all m> n. 

The sequence of random variables X,,,n > 1, is said to converge 
almost surely to the random variable X, written X, —>,, X, or 
limn-soo Xn = X a.s8., if with probability 1 


lim X, = X 
n 


The following proposition presents an alternative characterization 
of almost sure convergence. 


Proposition 1.32 X,—,; X if and only if for any «>0 
P(|\Xn — X|<e foralln>m)—-1 as m—-o 


Proof. Suppose first that X, —+,g, X. Fix e > 0, and for m > 1, 
define the event 


Am = {|Xn — X| <e for all n > m} 
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Because A,,,m > 1, is an increasing sequence of events, the continu- 
ity property of probabilities yields that 


lim P(Am) 
m 


P(lim Ap) 
m 
P(|Xn — X| <€ for all n sufficiently large) 
P(lim X, = X) 
nr 
i 


IV Il 


To go the other way, assume that for any ¢« > 0 


P(|\Xn —X|<e forall n>m)—-1 as m—-oo 


converge to Q, and let 


Let €;,7 > 1, be a decreasing sequence of positive numbers that 


Ami = {|Xn — X| < & for all n > m} 


Because Am; C Am+1i and, by assumption, limm P(Ami) = 1, it 
follows from the continuity property that 


1 = P( lim Ami) = P(Bi) 


where B; = {|Xn — X| < & for all n sufficiently large}. But 
B;,1 > 1, is a decreasing sequence of events, so invoking the con- 
tinuity property once again yields that 


1 = lim P(B;) = P(lim B;) 
t— OO v} 
which proves the result since 


lim B; = {for all i, |X, — X| < e&; for all n sufficiently large} 
{lim X, = X} 
nr 


Remark 1.33 The reason for the word “almost” in “almost surely” 
is that P(A) = 1 doesn’t necessarily mean that A‘ is the empty set. 
For example if X ~ U(0,1) we know that P(X # 1/3) = 1 even 
though {X = 1/3} is a possible outcome. 


1.7 Almost sure and dominated convergence 31 


The dominated convergence theorem is one of the fundamental 
building blocks of all limit theorems in probability. It tells you some- 
thing about what happens to the expected value of random variables 
in a sequence, if the random variables are converging almost surely. 
Many limit theorems in probability involve an almost surely converg- 
ing sequence, and being able to accurately say something about the 
expected value of the limiting random variable is important. 

Given a sequence of random variables X 1, X2,..., it may seem to 
you at first thought that X, — X a.s. should imply limp... E[Xn| = 
E|X]. This is sometimes called “interchanging limit and expecta- 
tion,” since E[X] = Ellimpoo Xn]. But this interchange is not al- 
ways valid, and the next example illustrates this. 


Example 1.34 Suppose U ~ U(0,1) and Xn = nIngisy. Since re- 
gardless of what U turns out to be, as soon as n gets larger than 
1/U we see that the terms X, in the sequence will all equal zero. 
This means X, — 0 as., but at the same time we have E[X,]| = 
nP(U < 1/n) = n/n = 1 for all n, and thus limp. E[Xn] = 1. 
Interchanging limit and expectation is not valid in this case. 


What’s going wrong here? In this case X, can increase beyond 
any level as n gets larger and larger, and this can cause problems 
with the expected value. The dominated convergence theorem says 
that if X,, is always bounded in absolute value by some other random 
variable with finite mean, then we can interchange limit and expec- 
tation. We will first state the theorem, give some examples, and 


then give a proof. The proof is a nice illustration of the definition of 
expected value. 


Proposition 1.35 The dominated convergence theorem. Suppose Xn — 


X a.s., and there is a random variable Y with E[Y]| < co such that 
|Xn| < Y for all n. Then 


E| lim Xp] = lim E|X,|. 
nO nr— oo 
This is often used in the form where Y is a nonrandom constant, 
and then it’s called the bounded convergence theorem. Before we 
prove it, we first give a couple of examples and illustrations. 
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Example 1.36 Suppose U ~ U(0,1) and X, = U/n. It’s easy to see 
that X, — 0 as., and the theorem would tell us that E[X,] — 0. 
In fact in this case we can easily calculate E[X,] = x — 0. The 
theorem applies using Y = 1 since |X,| < 1. 


Example 1.37 With X ~ N(0,1) let X, = min(X,n), and notice 
Xn — X almost surely. Since X;, < |X| we can apply the theorem 
using Y = |X| to tell us E[X,] — E[X}. 


Example 1.38 Suppose X ~ N(0,1) and let X, = XIx>-n—nlxe-n 
Again X, — X, so using Y = |X| the theorem tells us E[X,] — 
E[X]. 


Proof. Proof of the dominated convergence theorem. 

To be able to directly apply the definition of expected value, in 
this proof we assume X, > 0. To prove the general result, we can 
apply the same argument to X,+Y > 0 with the bound |X, + Y| < 
2Y. 

Our approach will be to show that for any ¢ > 0 we have, for all 
sufficiently large n, both (a) E[X,] > E[X] — 3c, and (b) E[X,] < 
E[X]| + 3e. Since ¢ is arbitrary, this will prove the theorem. 

First let N- = min{n : |X; — X| < ¢ for all i > n}, and note that 
Xn —as X implies that P(N, < co) = 1. To prove (a), note first 
that for any m 


Xn+e > min(X,m) — mIn,sn- 
The preceding is true when N, > n because in this case the right 
hand side is nonpositive; it is also true when N, < n because in this 
case X, te > X. Thus, 
E|X,] + ¢ > E[min(X,m)| — mP(N, > n). 

Now, |X| < Y implies that E[X] < E[Y] < 00. Consequently, using 
the definition of E[X], wé can find a simple random variable Z < X 
with E[Z] > E[X]—«. Since Z is simple, we can then pick m large 
enough so Z < min(X,m), and thus 


E({min(X,m)] > E[Z] > E[X]—e. 
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Then N. < oo implies, by the continuity property, that mP(N, > 
n) < e for sufficiently large n. Combining this with the preceding 
shows that for sufficiently large n 


E[X,] +e > E[X] —2e, 


which is part (a) above. 

For part (b), apply part (a) to the sequence of nonnegative ran- 
dom variables Y — X,, which converges almost surely to Y — X with 
a bound |Y — X,| < 2Y. We get E[Y — X,| > E|Y — X] — 3c, and 
re-arranging and subtracting E[Y] from both sides gives (b). m 


Remark 1.39 Part (a) in the proof holds for non-negative random 
variables even without the upper bound Y and under the weaker 
assumption that infms,#Xm— X as n— oo. This result is usually 
referred to as Fatou’s lemma. That is, Fatou’s lemma states that for 
any € > 0 we have E[X,] > E[X]—e for sufficiently large n, or equiv- 
alently that infmsn E[Xm]| > E|X]-— for sufficiently large n. This 
result is usually denoted as lim inf, E[Xn] > Eflim infp_oo Xp]. 


A result called the monotone convergence theorem can also be 
proved. 


Proposition 1.40 The monotone convergence theorem. If 
0< X,7TX 
then E[X,] tT E[X]. 


Proof. If E[X] < oo, we can apply the dominated convergence the- 
orem using the bound |X,| < X. 


Consider now the case where E[X] = oo. For any m, we have 
min(X,,m) — min(X,m). Because E[min(X,m)] < oo, it follows by 
the dominated convergence theorem that 

lim E{min(Xy,,m)| = E|min(X, m)}. 
But since E[X,] > E[min(X,,m)], this implies 


lim E[X,]| > jim E[min(X, m)}. 
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Because E[X] = oo, it follows that for any K there is a simple 
random variable A < X such that E[A] > K. Because A is simple, 
A < min(X,m) for sufficiently large m. Thus, for any K 


lim E[{min(X,m)] > E[A] > Kk 
™m—- OO 
proving that lim... E{min(X,m)] = oo, and completing the proof. 


We now present a couple of corollaries of the monotone conver- 
gence theorem. 


Corollary 1.41 If X; > 0, then ED, Xi] = rae EX] 


Proof. 
©oO Te 
S > E[Xi] = lim 7) E[Xi] 
i=l i=1 
Tt 
= lim FID, Xj] 
= 


©. @) 
E()_ Xi] 
i=l 


where the final equality follows from the monotone convergence the- 
orem since >, Xi T 2, Xi. 


Corollary 1.42 If X and Y are independent, then 
B|XY] = E[X|E[Y]. 


Proof. Suppose first that X and Y are simple. Then we can write 


nr m 
X=) tiltxanj, Y=) yily—y) 


i=l j=l 
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Thus, 
E([XY]| = E(S >) cays] x=2,,¥=y;31 
a) 
= S~ ‘> viy; El, x=2,,¥=y;}1 
a) 
= YP = =u) 


b. 2g 


De dBi P(X = 2;)P(Y = y;) 
EIX)EIY] 


Next, suppose X,Y are general non-negative random variables. For 
any n define the simple random variables 


x SRM RK XS BB e=0,...,n2"-1 
™™ Ln, if X>n 


Define random variables Y,, in a similar fashion, and note that 
Xn TX, Yn TY, Xn¥n T XY 
Hence, by the monotone convergence theorem, 
E[XnYn| — E|XY] 
But X, and Y, are simple, and so 
E[XnYn| = E[X,JE|Y,| - ELXIE[Y], 
with the convergence again following by the monotone convergence 
theorem. Thus, E[XY] = E[X]E[Y] when X and Y are nonnegative. 
The general case follows by writing X = Xt —-X~, Y=Yt—-Y-, 
using that 
E[XY] = E[Xty*] — E[Xty7]- E[X-Y*]+ E[xX-Y7] 


and applying the result to each of the four preceding expectations. 
wa 
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1.6 Convergence in Probability and in Distribu- 
tion 


In this section we introduce two forms of convergence that are weaker 
than almost sure convergence. However, before giving their defini- 


tions, we will start with a useful result, known as the Borel-Cantelli 
lemma. 


Proposition 1.43 If >), P(Aj) < oo, then P(limsup Ax) = 0. 
Proof: Suppose }), P(Aj;) < oo. Now, 


P(limsup Ag) = P(N, UZ, Ai) 


=n 


Hence, for any n 
P(limsup Ay) < P(Uj2,A;) 


> P(A) 


=n 


lA 


and the result follows by letting n — oo. i 

Remark: Because 5°, 4, is the number of the events An,n > 1, 
that occur, the Borel-Cantelli theorem states that if the expected 
number of the events A,,n > 1, that occur is finite, then the prob- 
ability that an infinite number of them occur is 0. Thus the Borel- 
Cantelli lemma is equivalent to the rather intuitive result that if 
there is a positive probability that an infinite number of the events 
Ap occur, then the expected number of them that occur is infinite. 


The converse of the Borel-Cantelli lemma requires that the indi- 
cator variables for each pair of events be negatively correlated. 


Proposition 1.44 Let the events A;,i > 1, be such that 
Cov(Ia;,1a;) = Ella,Ta,| — Ella, JElIa,] <0, 1 4 3 
If So72, P(Ai) = 00, then P(limsup Aj) = 1. 


Proof. Let N, = oj. Ja, be the number of the events Aj,...,An 
that occur, and let N = 5°, I, be the total number of events that 
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occur. Let mn = E[Nn] = D037, P(Ai), and note that lim, mp = oo. 
Using the formula for the variance of a sum of random variables 
learned in your first course in probability, we have 


n 
Var(N,) = ba Var(I4,) + 2 2 Cov(Ia;, Ia; ) 
i=l i<j 


< > Var (I, ) 


i=1 
n 
= D0 P(Ai)[L - P(Ad)] 
i=l 
< Mn 
Now, by Chebyshev’s inequality, for any x < mp 


P(Nn < 2) 


P(m; — Nn > Mn — £) 
P(|Nn — Mn| > mn — 2) 
Var (Np) 
(mn, — x)? 
Mn 
(mp — x)? 


IA 


IA 


Hence, for any z, limp. P(Nn < x) = 0. Because 
P(N <x) < P(N, < x), this implies that 


P(N <x) =0. 
Consequently, by the continuity property of probabilities, 
0O= lim P(N <k) 
k—0o 
= P(lim{N < k}) 


= P(U,g{N < k}) 
= P(N < oo) 


Hence, with probability 1, an infinite number of the events A; occur. 
a 
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Example 1.45 Consider independent flips of a coin that lands heads 
with probability p > 0. For fixed k, let B, be the event that flips 
n,n+1,...,n+k-—1 all land heads. Because the events B,,n > 1, 
are positively correlated, we cannot directly apply the converse to 
the Borel-Cantelli lemma to obtain that, with probability 1, an infi- 
nite number of them occur. However, by letting A, be the event that 
flips nk +1,...,nk+k all land heads, then because the set of flips 
these events refer to are nonoverlapping, it follows that they are inde- 
pendent. Because >>, P(An) = >_, p* = 00, we obtain from Borel- 
Cantelli that P(limsup A,) = 1. But limsup A, C limsup By, so 
the preceding yields the result P(limsup B,)=1. @ 


Remark 1.46 The converse of the Borel-Cantelli lemma is usually 
stated as requiring the events A;,z > 1, to be independent. Our 
weakening of this condition can be quite useful, as is indicated by 
our next example. 


Example 1.47 Consider an infinite collection of balls that are num- 
bered 0,1,..., and an infinite collection of boxes also numbered 
0,1,.... Suppose that ball 7,2 > 0, is to be put in box i + X; where 
X;,1 > 0, are iid with probability mass function 


P(Xi=j)=pj Yo pj=1 
j20 
Suppose also that the X; are not deterministic, so that p; < 1 for all 
j = 0. If A; denotes the event that box j remains empty, then 


P(Aj) = P(X; #0, Xj-1 #1,...,X0 #39) 
P(Xo £0, X1 #1,...,X; #3) 
P(X; # i, for all i > 0) 


V 


But 
P(X; # i, for all i > 0) 
= 1— P(Ui>0{Xi = t}) 
=1—po— > P(Xo #0,...,Xi-1 #i-1,Xj = 1) 


i>1 


i-1 
=1-p-—> pi [[-p,) 


i>l j7=0 
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Now, there is at least one pair k < i such that p;p, = p > 0. Hence, 
for that pair 


i—1 


pi |] — 3) < pill — pe) = Di —P 
j=0 


implying that 
P(A;) > P(X; #1,for alli > 0) > p>0 


Hence, 57; P(Aj) = 00. Conditional on box j being empty, each ball 
becomes more likely to be put in box 7,71 4 7, so, for 2 <j, 


[] (Xe #i- IAs) 


k=0 


]] PX Ai kX, #G-&) 


k=0 


[[ P% 4#i-4 
k=0 
= P(Aj) 


P(Aj|A;) 


IA 


which is equivalent to Cov(I4,,I4;) < 0. Hence, by the converse of 
the Borel-Cantelli lemma we can conclude that, with probability 1, 
there will be an infinite number of empty boxes. 


We say that the sequence of random variables X,,n > 1, con- 


verges in probability to the random variable X, written X, —-, X, 
if for any € > 0 


P(|Xn —-X|>6«)-0 as n-0co 


An immediate corollary of Proposition 1.32 is that almost sure con- 


vergence implies convergence in probability. The following example 
shows that the converse is not true. 


Example 1.48 Let X,,n > 1 be independent random variables such 
that 


P(X, = 1) = 1/n=1- P(X, = 0) 
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For any e > 0, P(|X,y| > €) = 1/n — 0; hence, Xn —~, 0. However, 
because }°°>°., P(Xn = 1) = 00, it follows from the converse to the 
Borel-Cantelli lemma that X, = 1 for infinitely many values of n, 
showing that the sequence does not converge almost surely to 0. 


Let F,, be the distribution function of X,, and let F' be the dis- 
tribution function of X. We say that X, converges in distribution 
to X if 


Jim F, (x) = F(z) 


for all x at which F is continuous. (That is, convergence is required 
at all x for which P(X = x) = 0.) 

To understand why convergence in distribution only requires that 
F(z) — F(a) at points of continuity of F’, rather than at all values z, 
let X,, be uniformly distributed on (0, 1/n). Then, it seems reasonable 
to suppose that X, converges in distribution to the random variable 
X that is identically 0. However, 


0, if ¢<0 
Fi(z)=< nz, if O<ax<1/n 
1, if r>1/n 


while the distribution function of X is 


0, if r<0 
F@)=47 if r>0 


and so lim, F,(0) = 0 # F(0) = 1. On the other hand, for all 


points of continuity of F' - that is, for all 2 #4 0 - we have that 


lim, F, (xz) = F(z), and so with the definition given it is indeed true 
that X, —+q X. 


We now show that convergence in probability implies convergence 
in distribution. 


Proposition 1.49 


Keo es Kee x 
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Proof: Suppose that X, —>, X. Let F, be the distribution 
function of Xnj,n > 1, and let F' be the distribution function of X. 
Now, for any e > 0 


F,(z) = P(Xn <u, X <rt+e)+ P(Xn <x, X >rt+e) 
< F(x+e)+ P(|\Xn — X| > €) 


where the preceding used that 


Xn <2, X >rt+e > (|X, -X|>€ 


Letting n go to infinity yields, upon using that X, —, X, 
lim sup F,(z) < F(z + €) (1.4) 
na--OO 


Similarly, 


F(x—e) = P(X <a2-e, X, < 2)+ P(X <2-€, Xn >2) 


F,(z) + P(|\Xn — X| >) 


lA 


Letting n — oo gives 
F(iz-—e)< lim inf F(z) (1.5) 
Combining equations (1.4) and (1.5) shows that 
F(a —e) < liminf F,(x) < limsup F,(z) < F(x + €) 

T= 00 n—0o 

Letting e — 0 shows that if x is a continuity point of F then 
F(z) < liminf F,(x) < limsup F, (2) < F(z) 
SoS n—+00 


and the result is proved. a 


Proposition 1.50 Jf X,—+q X, then 


for any bounded continuous function g. 
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To focus on the essentials, we will present a proof of Proposition 
1.50 when all the random variables X,, and X are continuous. Before 
doing so, we will prove a couple of lemmas. 


Lemma 1.51 Let G be the distribution function of a continuous 
random variable, and let G~+ (x) = inf {t : G(t) > x}, be its inverse 
function. If U is a uniform (0,1) random variable, then G-1(U) has 
distribution function G. 


Proof. Since 
inf {t: G(t) >U} <2 G(z) >U 
implies 
P(G71(U) < 2) = P(G(e) > U) = G(2), 


we get the result. m 


Lemma 1.52 Let X, —qg X, where Xn is continuous with dis- 
tribution function F,, n > 1, and X is continuous with distribution 
function F'. If 


Fir(tn) > F(z), where 0 < F(z) <1 
then In — 2. 


Proof. Suppose there is an € > 0 such that 2, < x2 —€ for infinitely 
many n. If so, then F,(r,) < F,(x — €) for infinitely many n, 
implying that 


F(x) = liminf F,(2,) < lim F,(z — €) = F(a -€) 
n n 


which is a contradiction. We arrive at a similar contradiction upon 
assuming there is an € > 0 such that 7, > 2+ « for infinitely many 
n. Consequently, we can conclude that for any e > 0, |zn — 2| > € 
for only a finite number of n, thus proving the lemma. a 


Proof of Proposition 1.50: Let U be a uniform (0,1) random 
variable, and set Y, = F7'(U),n >1, and Y = F~'(U). Note that 
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from Lemma 1.51 it follows that Y, has distribution F,, and Y has 
distribution F’. Because 


F,(Fy (u)) = u = F(F*(u)) 


it follows from Lemma 1.52 that F7!(u) — F-}(u) for all u. Thus, 
Yn —as Y. By continuity, this implies that g(Yn) —as g(Y), and, 
because g is bounded, the dominated convergence theorem yields 
that E[g(Yn)| — Elg(Y)]. But X, and Y, both have distribution 
F,, while X and Y both have distribution F, and so E[g(Y,)| = 
Elg(Xn)] and Elg(¥)] = Elg(X)|. 


Remark 1.53 The key to our proof of Proposition 1.50 was showing 
that if X, —+q X, then we can define random variables Y,,n > 1, 
and Y such that Y, has the same distribution as X,, for each n, and Y 
has the same distribution as X, and are such that Y, —,, Y. This 
result (which is true without the continuity assumptions we made) 
is known as Skorokhod’s representation theorem. 


Skorokhod’s representation and the dominated convergence the- 
orem immediately yield the following. 


Corollary 1.54 If X, —+q X and there exists a constant M < oo 
such that |X,| < M for all n, then 


tim E[Xn] = ELX] 


Proof. Let F, denote the distribution of X,, n > 1, and F that of X. 
Let U be a uniform (0,1) random variable and, for n > 1, set Y; = 
F-1(U), and Y = F~1/(U). Note that the hypotheses of the corollary 
imply that Y, —as Y, and, because F,(M) = 1 =1-—F,(—M), also 
that |Y,| < M. Thus, by the dominated convergence theorem 


BE [Y;,] — Ff [Y] 


which proves the result because Y,, has distribution F,,, and Y has 
distribution F’. m 


Proposition 1.50 can also be used to give a simple proof of Weier- 
strass’ approximation theorem. 
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Corollary 1.55 Weierstrass’ Approximation Theorem Any con- 
tinuous function f defined on the interval [0,1] can be expressed as 
a limit of polynomial functions. Specifically, if 


B,(t) = 3 f(i/n) (") (1 —t)"-? 


i=0 
then f(t) = limnoo Bn(t). 
Proof. Let X;,7 > 1, be a sequence of iid random variables such that 
P(X; = 1) =t=1-—- P(X; =0) 


Because E [AnttAn] = t, it follows from Chebyshev’s inequality 
that for any « > 0 


Xyt...4¢X Var({X1 +...+ X. 1- 

n E ne 
Thus, * fs: x —, t, implying that MitetAn — qt. Because f 
is a continuous function on a closed interval, it is bounded and so 
Proposition 1.50 yields that 


Xi, + 


Eif( ny ry 


But X;+...+ Xp is a binomial (n,t) random variable; thus, 


Elf ( 


and the proof is complete. 
a 


atts) = Balt) 
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Definition 1.56 For a sequence of random variables X1, Xo,... the 
tail sigma field 7 is defined as 


T = PZ 10(Xn, », ne ea) 


Events A € T are called tail events. 
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Though it may seem as though there are no events remaining in 
the above intersection, there are lots of examples of very interesting 
tail events. Intuitively, with a tail event you can ignore any finite 
number of the variables and still be able to tell whether or not the 
event occurs. Next are some examples. 


Example 1.57 Consider a sequence of random variables X 1, X9,... 
having tail sigma field J and satisfying |X;| < oo for all 7. For the 
event A; = {limp—oo + +-) Xi = &}, it’s easy to see that A; € 
T because to determine if A, happens you can ignore any finite 
number of the random variables; their contributions end up becoming 
negligible in the limit. 

For the event B, = {sup; X; = x}, it’s easy to see that B,. € T 
because it depends on the long run behavior of the sequence. Note 


that Bz ¢ T because it depends, for example, on whether or not 
X1 <7. 


Example 1.58 Consider a sequence of random variables Xj, X9,... 
having tail sigma field T, but this time let it be possible for X; = 
oo for some i. For the event A; = {limp—oo 4 pi = x} we 
now have A, ¢ T, because any variable along the way which equals 
infinity will affect the limit. 


Remark 1.59 The previous two examples also motivate the subtle 
difference between X; < co and X; < co almost surely. The former 
means it’s impossible to see X5 = co and the latter only says it has 
probability zero. An event which has probability zero could still be a 
possible occurrence. For example if X is a uniform random variable 
between zero and one, we can write X # .2 almost surely even though 
it is possible to see X = .2. 


One approach for proving an event always happens is to first 
prove that its probability must either be zero or one, and then rule 
out zero as a possibility. This first type of result is called a zero-one 
law, since we are proving the chance must either be zero or one. A 
nice way to do this is to show an event A is independent of itself, 
and hence P(A) = P(Af)A) = P(A)P(A), and thus P(A) = 0 or 
1. We use this approach next to prove a famous zero-one law for 
independent random variables, and we will use this in our proof of 
the law of large numbers. 
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First we need the following definition. Events with probability 
either zero or one are called trivial events, and a sigma field is called 
trivial if every event in it is trivial. 


Theorem 1.60 Kolmogorov’s Zero-One Law. A sequence of indepen- 
dent random variables has a trivial tail sigma field. 


Before we give a proof we need the following result. To show that 
a random variable Y is independent of an infinite sequence of ran- 
dom variables X,, X9,..., it suffices to show that Y is independent 
of X), X2,...,Xn for every finite n < oo. In elementary courses this 
result is often given as a definition, but it can be justified using 
measure theory in the next proposition. We define o(X;,i € A) = 
a(Uieao(X;)) to be the smallest sigma field generated by the collec- 
tion of random variables X;,7 € A. 


Proposition 1.61 Consider the random variables Y and X,, X9,... 
where o(Y) is independent of o(X1, Xo,....Xn) for every n < o. 
Then a(Y ) is independent of o(X1, Xo, ...). 


Before we prove this proposition, we show how this implies Kol- 
mogorov’s Zero-One Law. 
Proof. Proof of Kolmogorov’s Zero-One Law. We will argue that 
any event A € T is independent of itself, and thus P(A) = P(AN 
A) = P(A)P(A) and so P(A) = 0 or 1. Note that the tail sigma 
field T is independent of o(X 1, X2,..., Xn) for every n < oo (since 
T © o(Xn41, Xn42,---)), and so by the previous proposition is also 
independent of o(X,, Xo, ...) and thus, since T C o(X), Xo,...), also 
is independent of 7. mg 


Now we prove the proposition. 


Proof. Proof of Proposition 1.61. Pick any A € o(Y). You might 
at first think that 71 = UP2,0(X1, Xo,...,Xn) is the same as F = 
o(X,, Xo,...), and then the theorem would follow immediately since 
by assumption A is independent of any event in 7H. But it is not 
true that 7{ and F are the same — #{ may not even be a sigma field. 
Also, the tail sigma field T is a subset of F but not necessarily of 
H. It is, however, true that F C o(H) (in fact it turns out that 
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o(H) = F) which follows because o(.X1, X2,...) = o(U%,0(Xn)) 
and US%_,0(Xn) C H. We will use F C o(H) later below. 

Define the collection of events G to contain any B € F where 
for every « > 0 we can find a corresponding “approximating” event 
C € H where P(BNC*) + P(BSNC) < «. Since A is independent 
of any event C' € 1, we can see that A must also be independent of 
any event B € G because, using the corresponding “approximating” 
event C' for any desired € > 0, 


P(AN B) = P(ANBNC)+P(ANBNC) 
< P(ANC)+ P(BNC*) 
< P(A)P(C) +.€ 
= P(A)(P(CN B)+ P(CNB’*)) +e 
< P(A)P(B) + 2 


and 


1— P(ANB) = P(AS UB’) 
= P(A°) + P(ANB*) 
= P(A°)+ P(ANBSNC)+ P(ANBSNC) 
< P(A‘°) + P(BSNC)+ P(ANC’) 
< P(A) +€+ P(A)P(C*) 
= P(A°) + «+ P(A)(P(C°N B) + P(C°N B*)) 
< P(A‘) + 2e + P(A)P(B°) 
= 1+ 2 — P(A)P(B), 


which when combined gives 
P(A)P(B) — 2e < P(AN B) < P(A)P(B) + 2e. 


Since € is arbitrary, this shows o(Y) is independent of G. We 
obtain the proposition by showing F C o(H) C G, and thus that 
o(Y) is independent of F, as follows. First note we immediately 
have H C G, and thus o(H) C o(G), and we will be finished if we 
can show o(G) = G. 

To show that G is a sigma field, clearly Q € G and BS € G 
whenever B € G. Next let By, Bo,... be events in G. To show that 
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UX, B; € G, pick any € > 0 and let C; be the corresponding “approx- 
imating” events that satisfy P(B; NC?) + P(BENC;) < €/2't'. Then 
pick n so that 


S_ P( Bin BE, 9 BE_gN-+-M Bi) < €/2 


i>n 
and below we use the approximating event C' = UjL,Ci € H to get 


P(U;B; NC) + P((U;Bi)° NC) 
< P(UZ, Bi, NC*) + €/2 + P((Uz, Bie NC) 
< 5> P(BiN Cf) + P(BNC;) +€/2 


< So /it) +€/2 
2 
= €, 


and thus U2, B:€G. @ 
A more powerful theorem, called the extension theorem, can be 


used to prove Kolmogorov’s Zero-One Law. We state it without 
proof. 


Theorem 1.62 The extension theorem. Suppose you have random 
variables X,,Xo,... and you consistently define probabilities for all 
events in o(X,, Xo,...,Xn) for every n. This implies a unique value 
of the probability of any event in o( Xj, Xo, ...). 


Remark 1.63 To see how this implies Kolmogorov’s Zero-One Law, 
specify probabilities under the assumption that A is independent of 


any event B € U°C,F,. The extension theorem will say that A is 
independent of o(US~_, Fn). 


We will prove the law of large numbers using the more powerful 
ergodic theorem. This means we will show that the long-run aver- 
age for a sequence of random variables converges to the expected 
value under more general conditions then just for independent ran- 
dom variables. We will define these more general conditions next. 

Given a sequence of random variables X,, Xo,..., suppose (for 
simplicity and without loss of generality) that there is a one-to-one 
correspondence between events of the form {X, = 11, Xo = %o, X3 = 
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r3...+ and elements of the sample space (2. An event A is called an 
invariant event if the occurrence of 


{Xy = 2{ Ag = 29,49 = r3...} EA 
implies both 


{Xy = 29, X2 = 43, X3 = Tauee} EA 
and 
{Xy = £9, X2 = 21, X3 = rQ...} EA. 


In other words, an invariant event is not affected by shifting the 
sequence of random variables to the left or right. For example, A = 
{sup,,>1 Xn = 00} is an invariant event if X,, < co for all n because 
SUPp>1 Xn = 00 implies both sup,>; Xn41 = oo and sup,s; Xn-1 = 
OO. 

On the other hand, the event A = {limp Xan = 0} is not invariant 
because if a sequence 2X2, %4,... converges to 0 it doesn’t necessarily 
mean that 2),273,... converges to 0. Consider the example where 
P(X; = 1) = 1/2 =1—- P(X, = 0) and X, = 1-—Xp_; forn> 1. In 
this case either Xo, = 0 and Xon_1 = 1 for all n > 1 or Xon = 1 and 
Xon—1 = 0 for all n > 1, so {limp Xen = 0} and A = {limp Xon_1 = 
0} cannot occur together. 

It can be shown (see Problem 21 below) that the set of in- 
variant events makes up a sigma field, called the invariant sigma 
field, and is a subset of the tail sigma field. A sequence of ran- 
dom variables X,, X9,... is called ergodic if it has a trivial invari- 
ant sigma field, and is called stationary if the random variables 
(X 1, X9,..., Xn) have the same joint distribution as the random vari- 
ables (Xx, Xk41,--+;Xn+k—-1) for every n, k. 

We are now ready to state the ergodic theorem, and an immediate 
corollary will be the strong law of large numbers. 


Theorem 1.64 The ergodic theorem. If the sequence X,Xo,... is 
stationary and ergodic with E|X;| < oo, then 2 >7°7_, Xi > E[X]] 
almost surely. 


Since a sequence of independent and identically distributed ran- 
dom variables is clearly stationary and, by Kolmogorov’s zero-one 


law, ergodic, we get the strong law of large numbers as an immedi- 
ate corollary. 
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Corollary 1.65 The strong law of large numbers. If X1, X2,... are iid 
with E|X,| < 00, then + ?_, Xi + E[X;] almost surely. 


Proof. Proof of the ergodic theorem. Given € > 0 let Y; = X; — 
E|X,] —e and M, = max(0,Y1,¥1,+ Yo,...,.¥i + Ye+---+Yn). Since 
ee Y; < + Mp, we will first show that M,,/n — 0 almost surely, 
and then the theorem will follow after repeating the whole argument 
applied instead to Y; = —X; + E[X,] —e. 

Letting M; = max(0, Yo, Yo + Y3,...,. Yo + Y34+-°-°°+ Yn+1) and 
using stationarity in the last equality below, we have 


E(Mn+1] = E[max(0, Yi + M7)] 
= E|M,, + max(—M,,, Yi)] 
= E[M,] + E[max(—M/,, Y;)| 


and, since M, < M,41 implies E[M,] < E|Mn41], we can conclude 
E|max(—M/, Y,)| > 0 for all n. 

Since {M,,/n — 0} is an invariant event, by ergodicity it must 
have probability either 0 or 1. If we were to assume the prob- 
ability is 0, then Mni1 > M, would imply M, — oo and also 
M,, — oo, and thus max(—M/,, Y,) — Y;. The dominated conver- 
gence theorem using the bound |max(—M7,Y1)| < |Yi| would then 
give E[max(—M/, Y;)] —~ E[Yi] = —e, which would then contradict 
the previous conclusion that E{[max(—M/,Y,)] > 0 for all n. This 
contradiction means we must have M,/n — 0 almost surely, and the 
theorem is proved. m 
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1.10 Exercises 


1. 


10. 
11. 


Given a sigma field F, if A; € F for all 1 <i< n, is Me Ai € 
F? 


. Suppose F;,7 = 1,2,3... are sigma fields. (a) Is NS& <1 Fi neces- 


sarily always a sigma field? Explain. (b) Does your reasoning 
in (a) also apply to the intersection of an uncountable number 


of sigma fields? (c) Is U%,F; necessarily always a o - field? 
Explain. 


. (a) Suppose Q = {1,2,...,n}. How many different sets will 


there be in the sigma field generated by starting with the indi- 
vidual elements in 2? (b) Is it possible for a sigma field to have 
a countably infinite number of different sets in it? Explain. 


. Show that if X and Y are real-valued random variables mea- 


surable with respect to some given sigma field, then so is XY 
with respect to the same sigma field. 


. If X is a random variable, is it possible for the cdf Pia) = 


P(X < 2) to be discontinuous at a countably infinite number 
of values of x? Is it possible for it to be discontinuous at an 
uncountably infinite number of values of x? Explain. 


. Show that E[X] = 5°, 2;P(X = 2;) if X can only take a count- 


ably infinite number of different possible values. 


. Prove that if X > 0 and E[X] < 00, then limp. E[XIxsn] = 


0. 


. Assume X > 0 is a random variable, but don’t necessarily 


assume that E/[1 ies < 00. _— that limp E[$Ixsn] = 0 
and limp. E[—-yIxsn] = 


. Use the definition of expected value in terms of simple variables 


to prove that if X > 0 and E[X] = 0 then X = 0 almost surely. 


Show that if X, —+qc then X, —, c. 


Show that if E[g(Xn)| — E[g(X)] for all bounded, continuous 
functions g, then X, —>q X. 


2 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 
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If X,, Xo,... are non-negative random variables with the same 
distribution (but the variables are not necessarily independent) 
and E[X,]| < oo, prove that limp... E[maxien X;/n| = 0. 


For random variables X;, X2,...., let J be the tail sigma field 
and let S, = 07, Xi. (a) Is {limp Sn/n > 0} € T? (b) Is 
{limn+.oo Sn > 0} € T? 


If X,, Xo... are non-negative iid rv’s with P(X; > 0) > 0, show 
that P| D, Fe — 00) a 


Suppose X 1, X2... are continuous iid rv’s, and 


Yn = Lex, >maxicn X;} 


(a) Argue that Y; is independent of Yj, for i 4 j. (b) What is 
P(> 3321 Yi < 00)? (c) What is P(>"3", YiYi+1 < 00)? 


Suppose there is a single server and the ith customer to ar- 
rive requires the server spend U; of time serving him, and 
the time between his arrival and the next customer’s arrival 
is V;, and that X; = U; — V; are iid with mean p. (a) If 
Qn+1 is the amount of time the (n + 1)st customer must wait 
before being served, explain why Qn41 = max(Qn + Xn, 0) 
= max(0, Xn, Xn + Xn-1,---, Xn +... + Xz). (b) Show P(Qn > 
oo) =lifp>od. 


Given a nonnegative random variable X define the sequence 
of random variables Y, = min(|2"X|/2",n), where |x| de 
notes the integer portion of x. Show that Y, | X and E[X] = 
lim, E[Y;]. 


Show that for any monotone functions f and g if X,Y are 
independent random variables then so are f(X),9(Y). 


Let X,, X2,... be random variables with X; < oo and suppose 
>, P(Xn > 1) < c0. Compute P(sup,, Xn < 00). 


Suppose X, —, X and there is a random variable Y with 


E[Y] < co such that |X,| < Y for all n. Show Eflimp oo Xn] = 
limn—oo E[Xp]. 
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2k. 


22. 


23. 


For random variables X,, X2...., let J and Z respectively be the 
set of tail events and the set of invariant events. Show that Z 
and 7 are both sigma fields. 


A ring is hanging from the ceiling by a string. Someone will 
cut the ring in two positions chosen uniformly at random on 
the circumference, and this will break the ring into two pieces. 
Player I gets the piece which falls to the floor, and player II 
gets the piece which stays attached to the string. Whoever gets 
the bigger piece wins. Does either player have a big advantage 
here? Explain. 


A box contains four marbles. One marble is red, and each 
of the other three marbles is either yellow or green - but you 
have no idea exactly how many of each color there are, or even 
if the other three marbles are all the same color or not. (a) 
Someone chooses one marble at random from the box and if 
you can correctly guess the color you will win $1,000. What 
color would you guess? Explain. (b) If this game is going to 
be played four times using the same box of marbles (and in 
between replacing the marble drawn each time), what guesses 


would you make if you had to make all four guesses ahead of 
time? Explain. 


Chapter 2 


Stein’s Method and Central 
Limit Theorems 


2.1 Introduction 


You are probably familiar with the central limit theorem, saying that 
the sum of a large number of independent random variables follows 
roughly a normal distribution. Most proofs presented for this cele- 
brated result generally involve properties of the characteristic func- 
tion ¢(t) = Efe***] for a random variable X, the proofs of which are 
non-probabilistic and often somewhat mysterious to the uninitiated. 
-One goal of this chapter is to present a beautiful alternative proof 
of a version of the central limit theorem using a powerful technique 
called Stein’s method. This technique also amazingly can be applied 
in settings with dependent variables, and gives an explicit bound on 
the error of the normal approximation; such a bound is quite difficult 
to derive using other methods. The technique also can be applied 
to other distributions, the Poisson and geometric distributions in- 
cluded. We first embark on a brief tour of Stein’s method applied in 
the relatively simpler settings of the Poisson and geometric distribu- 
tions, and then move to the normal distribution. As a first step we 
introduce the concept of a coupling, one of the key ingredients we 
need. 
In section 2 we introduce the concept of coupling, and show how it 
can be used to bound the error when approximating one distribution 
with another distribution, and in section 3 we prove a theorem by 


D0 
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Le Cam giving a bound on the error of the Poisson approximation 
for independent events. In section 4 we introduce the Stein-Chen 
method, which can give bounds on the error of the Poisson approx- 
imation for events with dependencies, and in section 5 we illustrate 
how the method can be adapted to the setting of the geometric dis- 
tribution. In section 6, we demonstrate Stein’s method applied to 
the normal distribution, obtain a bound on the error of the normal 
approximation for the sum of independent variables, and use this to 
prove a version of the central limit theorem. 


2.2 Coupling 


One of the most interesting properties of expected value is that E [LX — 
Y] = E[X|—E[Y] even if the variables X and Y are highly dependent 
on each other. A useful strategy for estimating E[X]— E[Y] is to 
create a dependency between X and Y which simplifies estimating 


E[|X —Y]. Such a dependency between two random variables is called 
a coupling. 


Definition 2.1 The pair (X LY) is a “coupling” of the random vari- 
ables (X,Y) if X =q X and Y = Y. 


Example 2.2 Suppose X,Y and U are U(0,1) random variables. 
Then both (U,U) and (U,1—U) are couplings of (X,Y). 


A random variable X is said to be “stochastically smaller” than 
Y, also written as X <,; Y, if 


P(X <2)>P(Y <2),Vz, 


and under this condition we can create a coupling where one variable 
is always less than the other. 


Proposition 2.3 If X Sst Y it is possible to construct a coupling 
(X,Y) of (X,Y) with X < Y almost surely. 


Proof. With F(t) = P(X < t), G(t) = P(Y < t) and F-!(z) = 
inf {t: F(t) > x} and G~! (zx) = inf {t: G(t) > z}, let U ~ U(0, 1), 


2.2 Coupling 5/7 


X =F-(U), and Y = :Gu! (U). Since F(t) > G(t) implies F-!(x) < 
G-1(xr), we have X <Y. And since 


inf {t: F(t) >U}<ae F(z) >U 
implies 
P(F-'(U) < 2) = P(F(x) > U) = F(z), 


we get xX =q X and Y =4 Y after applying the same argument to 
G. a 


Example 2.4 If X ~ N(0,1) and Y ~ N(1,1), then X <s Y. To 
show this, note that (X, Y) = (X,1+ X) is a coupling of (X,Y). 
Since X < Y we must have X < st Y and thus X < <ot Y. 


Even though the probability two independent continuous random 
variables exactly equal each other is always zero, it is possible to cou- 
ple two variables with completely different density functions so that 
they equal each other with very high probability. For the random 
variables (X,Y), the coupling (X,Y) is called a maximal coupling 
if P(X = Y) is as large as possible. We next show how large this 
probability can be, and how to create a maximal coupling. 


Proposition 2.5 Suppose X andY are random variables with respec- 
tive piecewise continuous density functions f and g. The mazimal 


coupling (X,Y) for (X,Y) has 
P(X =P)= [ min(F(a),9(x) dz 


Proof. Letting p = f°. min( f(z), g(x))dx and A = {zx : f(x) < 
g(x)}, note that any coupling (X,Y) of (X,Y) must satisfy 


P(X =Y)=P(X =Y€A)+P(X =YEA*) 
< P(X € A)+P(Y € AS) 


= | soars f a(eaz 


= p. 
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We use the fact that f,g are piecewise continuous to justify that 
the integrals above (and in the next step below) are well defined. 
Next we construct a coupling with P(X = Y) > p which, in light 
of the previous inequality, must therefore be the maximal coupling. 
Let B,C, and D be independent random variables with respective 
density functions 


(2) = mints), 9(@)) 


Pp 
gj tO Se @) 5) 
1—p 
- (x) — min(f(z), 9(x)) 
Q\r min( f(x), g(x 
a > 
let 


I~ Bernoulli(p), 
and if J = 1 then let X = = B and otherwise let X =C and 
Y = D. This clearly gives P(X = Y)> > PUI =1)=>p, and 
P(X <2) = P(X <a|I =1)p+ P(X < a|I = 0)(1—p) 


= pf b(x)dx + (1 — p) 7: c(x)dx 


ai Sade, 


and the same argument again gives P(Y < x) = P(Y <2). 

a 

Using this result, it’s easy to see how to create a maximal cou- 
pling for two discrete random variables. 


Corollary 2.6 Suppose X and Y are discrete random variables each 
taking values in a set A, and have respective probability mass func- 
tions f(x) = P(X = x) and g(x) = P(Y = x). The mazimal cou- 
pling of (X,Y) has 


P(X =Y) =} | min (g(2), f(2)). 


2.2 Coupling 3Y 


Proof. First in a case where X and Y are integer valued, we can 
apply the previous proposition to the continuous variables X +U and 
Y +U, where U ~ U(0,1), to get the result. And since any discrete 
random variable can be expressed as a function of an integer-valued 


random variable, the result for general discrete random variables will 
hold. m 


There is a relationship between how closely two variables can be 
coupled and how close they are in distribution. One common measure 
of distance between the distributions of two random variables X and 
Y is called total variation distance, defined as 


dry (X,Y) = sup eA € A)—P(Y€EA)|. 
We next show the link between total variation distance and couplings. 
Proposition 2.7 If (X Y) is a maximal coupling for (X,Y), then 
dry(X,Y) = P(X #Y). 


Proof. The result will be proven under the assumption that X,Y 
are continuous with respective density functions f,g. Letting A = 
{xz : f(z) > g(xz)}, we must have 


dry (X,Y) 
= max{P(X € A) — P(Y € A), P(Y € A°) — P(X € A‘)} 
= max{P(X € A) — P(Y € A),1— P(Y € A) -—1+P(X € A)} 
= P(X € A)—P(Y € A) 


and so 


PR ¢P)=1- [ ” min( f(z), 9(a))dex 


V1 - | (zee - I. f(x)dx 


=1—P(Y € A)-1+P(X€A) 
= dry(X,Y). 
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2.3 Poisson Approximation and Le Cam’s Theo- 
rem 


It’s well-known that a binomial distribution converges to a Poisson 
distribution when the number of trials is increased and the prob- 
ability of success is decreased at the same time in such a way so 
that the mean stays constant. This also motivates using a Poisson 
distribution as an approximation for a binomial distribution if the 
probability of success is small and the number of trials is large. If the 
number of trials is very large, computing the distribution function 
of a binomial distribution can be computationally difficult, whereas 
the Poisson approximation may be much easier. 

A fact which is not quite as well known is that the Poisson dis- 
tribution can be a reasonable approximation even if the trials have 
varying probabilities and even if the trials are not completely inde- 
pendent of each other. This approximation is interesting because 
dependent trials are notoriously difficult to analyze in general, and 
the Poisson distribution is elementary. 

It’s possible to assess how accurate such Poisson approximations 
are, and we first give a bound on the error of the Poisson approxi- 


mation for completely independent trials with different probabilities 
of success. 


Proposition 2.8 Let X; be independent Bernoulli(p;), and let W = 
ip Xi, Z ~Poisson(A) and X = E[W] = >o_, pi. Then 


nr 
drv(W,Z) < > _ pi. 


i=1 


Proof. We first write Z = )>;__, Zi, where the Z; are independent 


and Z; ~Poisson(p;). Then we create the maximal coupling (Z;, X;) 
of (Z;, X;) and use the previous corollary to get 


P(Z; = X;) = 5 min(P(X: = k), P(Z; = k)) 


k=0 
= min(1 — p,, es) + min(p;, pie ?*) 
=1-—p,+pie™ 


= 1 — p?, 
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where we use e~* > 1— 2. Using that (3<”_, Z;, 0", Xi) is a 
coupling of (Z,W) yields that 


dry (W, Z) 


IN 


P() Zi # > Xi) 
21 i=1 

P(U:{Z; # X;}) 

DP # Xj) 


nr 
< Sop. 
7=1 


IN IA 


Remark 2.9 The drawback to this beautiful result is that when E|W] 
is large the upper bound could be much greater than 1 even if W 
has approximately a Poisson distribution. For example if W is a 
binomial (100,.1) random variable and Z ~Poisson(10) we should 
have W xq Z but the proposition tells gives us the trivial and useless 
result dry(W, Z) < 100 x (.1)? = 1. 


2.4 The Stein-Chen Method 


The Stein-Chen method is another approach for obtaining an upper 
bound on dry(W, Z), where Z is a Poisson random variable and W 
is some other variable you are interested in. This approach covers 
the distribution of the number of successes in dependent as well as 
independent trials with varying probabilities of success. 


In order for the bound to be good, it should be close to 0 when W 
is close in distribution to Z. In order to achieve this, the Stein-Chen 
method uses the interesting property of the Poisson distribution 


kP(Z =k) =AP(Z =k —1). (2.1) 
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Thus, for any bounded function f with f(0) = 0, 


S “kf (k)P(Z = k) 


k=0 


= f(k)P(Z =k-1) 


k=1 


ASS f+ 1)P(Z =1) 
7=0 
NE[f(Z + 1)]. 


E|Z f(Z)| 


The secret to using the preceding is in cleverly picking a function 
f so that dry(W, Z) < E[Wf(W)] —AE|f(W + 1)], and so we are 
likely to get a very small upper bound when W xq Z, (where we use 
the notation W ~q Z to mean that W and Z approximately have 
the same distribution). 


Suppose Z ~ Poisson(\) and A is any set of nonnegative integers. 
Define the function f4(k),k = 0,1, 2,..., starting with f4(0) = 0 and 
then using the following “Stein equation” for the Poisson distribu- 
tion: 


Afa(k + 1) — kfa(k) = Ikea — P(Z € A). (2.2) 


Notice that by plugging in any random variable W for k and 
taking expected values we get 


AE|fa(W +1) -— E[W fa(W)]] = P(W € A) - P(Z€ A), 
so that 


dry(W, Z) < sup JAE[fa(W + 1)] — E[W fa(W)]]. 


Lemma 2.10 For any A and i,j, |fa(i)— fa(j)| < min(1,1/A)|i—9| 


Proof. The solution to (2.2) is 


_ Tj<h — P(Z <k) 
MR +D= 2 SRS wy PE =) 
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because when we plug it in to the left hand side of (2.2) and use (2.1) 
in the second line below, we get 


Afalk + 1) — kfa(k) 
- Tick — P(Z <k) Tj<p-1 — P(Z < k-1) 
“LL PZ= bP = P= PZ= 
om = =k — P(Z= k) 
ae 4 P(Z =k)/P(Z = 5) 
= Thea — P(Z € A), 
which is the right hand side of (2.2). 


ome P(Z <k) | 3 klai-* _ 3 k!\-* 
P(Z =k) Py i! wr (k — i)! 
is increasing in k and 
1—P(Z<k) _ RUN ai 7 kX 
P(Z =k) "2. 9 i! = 2. Grm 


is decreasing in k, we see that f(;;(k +1) < f,;}(k) when j 4 k and 
thus 


fa(k +1) — fa(k) = d— fagy(k +1) — fyy(k) 


jeA 


< \~ fgy(K + 1) — fry (F) 


_j€A,j=k 
ce P(Z>k) P(Z <k-1) 
— DY AP(Z =k —1)/P(Z =k) 
_ P(Z>k) PZ sk—1) 
r k 

< PUS > k) | PO<Z<k) 
— DY r 

Lae 
~ ’ 


< min(1, 1/) 


and 


—falk +1) + fa(k) = fac(k + 1) — fae(k) < min(1,1/), 
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which together give 


\fa(k + 1) — fa(k)| < min(1,1/)). 
The final result is proved using 


max(i,j)—1 


Ifa) —fadl<s >>  |falk+1) — fa(k)| < |i — é] min(1, 1/)). 


k=min(i,j) 


Theorem 2.11 Suppose W = )>7_, Xi, where X; are indicator vari- 
ables with P(X; = 1) = A; and X= DUR_, Xi. Letting Z ~Poisson(X) 
and V; =a (W — 1|X; = 1), we have 


Tr 
drv(W, Z) < min(1,1/) 5° AE|W — Vil. 


41 


Proof. 
drv(W,Z)< sup JEDE[fa(W + 1)] -Wfa(W)]| 
< sup | Ss; Ei fa(W +1) — Xifa(W)]| 
11 
< sup ) ME|fa(W +1) — fa(Vi + 1)| 


i=1 


nr 
< min(1,1/)) > AE|W — Vil, 


i=1 


where V; is any random variable having distribution V; =qg (W — 
1|.X; = 1). gz 


Proposition 2.12 With the above notation if W > V; for alli, then 
we have 


drv(W, Z) <1 —Var(W)/E[W]. 
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Proof. Using W > V; we have 


nr nr 
> XE|W — Vil =) MEW — Vi] 
11 i=l 
Tr 
=’ +A-S$° dE + Vi 
i=1 
Tr 
=V+A-S° dA E[W|X; = 1 
ma 
= +A- 5° EX,W] 


i=] 


= A\—Var(W), 
and the preceding theorem along with \ = E[W] gives 


dry (W, Z) < min(1, 1/E[W])(E[W] — Var(W)) 
<1—-Var(W)/E|W). 


Example 2.13 Let X; be independent Bernoulli(p;) random vari- 
ables with A = )oi_, pi. Let W = Soy, Xi, and let Z ~ Poisson(A). 
Using V; = isi X;, note that W > V; and E[W — V;] = pi, so the 
preceding theorem gives us 


: : Tr 
drv(W, Z) < min(1,1/d) )~p?. 


i=1 


For instance, if X is a binomial random variable with parameters 
n = 100,p = 1/10, then the upper bound on the total variation 
distance between X and a Poisson random variable with mean 10 
given by the Stein-Chen method is 1/10, as opposed to the upper 
bound of 1 which results from the LeCam method of the preceding 
section. 


Example 2.14 A coin with probability p = 1—gq of coming up heads 
is flipped n+ k times. We are interested in P(R;,), where Ry is the 
event that a run of at least k heads in a row occurs. To approximate 
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this probability, whose exact expression is given in Example 1.10, let 
X; = 1 if flip 7 lands tails and flipsi+1,...,2+ all land heads, and 
let X; = 0 otherwise, 2 = 1,...,n. Let 


and note that there will be a run of k heads either if W > 0 or if the 
first k flips all land heads. Consequently, 


P(W > 0) < P(Rx) < P(W > 0)+p*. 


Because the flips are independent and X; = 1 implies X; = 0 for all 
j #% where |i — 7| < k, it follows that if we let 


i+k 
Vi=aw- > X; 
k 


j=i- 
then V; =q (W — 1|X; = 1) and W > V;j. Using that \ = E[W] = 
ngp* and E[W — V;| = (2k + 1)qp*, we see that 

dry (W, Z) < min(1,1/A)n(2k + 1)q2p**. 


where Z ~ Poisson(A). For instance, suppose we flip a fair coin 1034 
times, and want to approximate the probability that we have a run 
of 10 heads in a row. In this case, n = 1024,k = 10,p = 1/2, and so 
» = 1024/2" = .5. Consequently, 


P(W >0)x1-e° 


with the error in the above approximation being at most 21/2!*. 
Consequently, we obtain that 


1— e7 > — 21/2)2 < P(Rig > 0) < 1—e7 > + 21/2)? + (1/2)! 


Or 
0.388 < P(Rig > 0) <0.4 


Example 2.15 Birthday Problem. With m people and n days in the 
year, let Y; = number of people born on day 7. Let X; = Iy,=0, 
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and W = )>j__, Xi equal the number of days on which nobody has a 
birthday. 

Next imagine n different hypothetical scenarios are constructed, 
where in scenario 7 all the Y; people initially born on day i have their 
birthdays reassigned randomly to other days. Let 1+ V; equal the 
number of days under scenario 7 on which nobody has a birthday. 

Notice that W > V; and V; =g (W — 1|X; = 1). Also we have 
A= E[W] = n(1 -1/n)™ and E[V;] = (n — 1)(1 — 1/(n — 1))™, so 
with Z ~ Poisson(A) we get 


dry (W, Z) < min(1,1/A)n(n(1 — 1/n)™ — (n— 1)(1 —1/(n — 1))”). 

2.5 Stein’s Method for the Geometric Distribu- 
tion 

In this section we will show how to obtain an upper bound on the 

distance dry(W, Z) where W is a given random variable and Z has 

a geometric distribution with parameter p = 1— q = P(W = 1) = 

P(Z = 1). We will use a version of Stein’s method applied to the 


geometric distribution. Define f4(1) = 0 and, for k = 1,2,..., using 
the recursion 


fa(k) — afa(k + 1) = Ikea — P(Z € A). 
Lemma 2.16 We have |.fa(i) — fa(j)| < 1/p 
Proof. It’s easy to check that the solution is 
fa(k) = P(Z € A, Z>k)/P(Z =k) — P(Z € A)/p 


and, since neither of the two terms in the difference above can be 
larger than 1/p, the lemma follows. m 


Theorem 2.17 Given random variables W and V such that V =g 
(W —1|W >1), let p= P(W =1) and Z ~geometric(p). Then 


drv(W, Z) < gp P(W # V). 
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Proof. 


|P(W € A) — P(Z € A)| = |E[fa(W) — afa(W + 1)]| 
= |qE[fa(W)|W > 1] — gEl[fa(W + 1)]| 
<qE|fa(l+V)— fatl+W)| 
<qp 'P(W #V), 


where the last inequality follows from the preceding lemma. m@ 


Example 2.18 Coin flipping. A coin has probability p = 1 — q of 
coming up “heads” with each flip, and let X; = 1 if the ith flip is 
heads and X; = Oif it is tails. We are interested in the distribution of 
the number of flips M required until the start of the first appearance 
of a run of k heads in a row. Suppose in particular we are interested 
in estimating P(M € A) for some set of integers A. 

To do this, instead let N be the number of flips required until the 
start of the first appearance of a run of tails followed by k heads in 
a row. For instance, if k = 3, and the flip sequence is HHTTHHH, 
then M = 4 and N = 5. Note that in general we will have N+1 = M 
if M > 1 so consequently, 


dry(M,N +1) < P(N+1=M) =p". 


We will first obtain a bound on the geometric approximation to N. 
Letting A; = {X; = 0, Xj41 = Xize = --- = Xiag = 1} be the 
event a run of a tail followed by k heads in a row appears starting 
with flip number 7, let W = min{i > 2: I4, = 1} — 1 and note that 
N =q W. 
Next generate V as follows using a new sequence of coin flips 
Y,, Yo,... defined so that Y; = X;,Vi > k +1 and the vector 


(Y1, Yo, --) Year) =a (X1, Xa, ..., Xe4i|La, = 0) 


is generated to be independent of all else. Then let B; = {Y; = 
0, Yiea = Yi4e =--- = Yigx = 1} be the event a run of tails followed 
by k heads in a row appears in this new sequence starting with flip 
number 1. If [4, = 0 then let V = W, and otherwise let V = min{i > 
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2: Ip, = 1}—1. Note that this gives V =g (W — 1|W > 1) and 
k+1 
PW #V)< >> P(ALNB) 
‘=2 
kt] 
= P(Ai) >| P(Ail4§) 
1=2 
k+1 
= P(Ai) >| P(A:)/P(A§) 
1=2 
oe kqp*. 


Thus if Z ~geometric(qp*) then applying the previous theorem gives 
dry (N,Z)< kp*. The triangle inequality applies to dry so we get 


dpy(M —1,Z) < drvy(M —1,N) +dry(N, Z) = (k + 1)p*. 


and 


P(Z € A) —(k+1)p* < P(M € A) < P(Z € A) + (k+1)p*. 


2.6 Stein’s Method for the Normal Distribution 
Let Z ~ N(0,1) be a standard normal random variable. It can be 


shown that for smooth functions f we have E [f'(Z) — Zf(Z)] = 9, 
and this inspires the following Stein equation. 


Lemma 2.19 Given a > 0 and any value of z let 
1 ifxr<z 
he,z(z) = h(x) = 0 ifr >z+a 
(a+z—2)/a otherwise, 
and define the function fa,z(z) = f(z), -—co < Z < @W, So it satisfies 
f(z) — xf (x) = h(x) — E{h(Z)]. 


Then |f'(x) — f(y) < 2la—yl, Vay. 
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Proof. Letting ¢(z) = et /2 / 2m be the standard normal density 
function, we have the solution 
a Elh(Z)Iz<2| — E|h(Z)|P(Z < z) 

$(z) 
which can be checked by differentiating using (¢(x)-!) = z/¢(z) 
to get f'(xz) = xf(zr) + h(x) — E[h(Z)}. Then this gives 

\f"(z)| = |f(@) + cf'(x) + h'(2)| 
= |(1+ 2°) f(x) + x(h(x) — E[h(Z)]) + h'(2)|. 


f(z) 


Since 
A(z) — BlA(Z)) 
= [ (h(x) — h(s))4(s)ds 


7 ij : | " H@ ode / i. / " hl(t)dto(s)ds 
= [ _ h'(t)P(Z < t)dt — i ~ h'(t)P(Z > t))dt, 


and a similar argument gives 


_ P(Z>2) [* ,, 
f(a) =- -“F> [ _ MOP < bt 


_P(Z <2) 
(x) 


co 
h'(t)P(Z > t)dt, 
4 
we get 


f"(x)| < |a'(x)| + [(1 + &*) f(x) + 2(A(z) — E[A(Z)]| 


1 
<— 
a 


4 ~(— + (1+ 27)P(Z > x)/¢(z))(zP(Z < x) + G(z)) 


4 ~(2 + (14+ 27)P(Z < x)/(x))(—aP(Z > x) + ¢(z)) 
< 2/a. 
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We finally use 


max(z,y) 
r@-rOls fi i"@lide < Se—ul 


min(z,y 


to give the lemma. @ 


Theorem 2.20 If Z ~ N(0,1) and W = S0i_, Xi where Xj are in- 
dependent mean 0 and Var(W) = 1, then 


sup|P(W < z)—-P(Z <z)| <2, aS avs 
‘i i=1 


Proof. Given any a > 0 and any z, define h, f as in the previous 
lemma. Then 


P(W <z)—P(Z<z) 
< Eh(W) — Eh(Z) + Eh(Z) — P(Z < z) 
< |E[A(W)] — E{h(Z)|| + P(z < Z <z+a) 


< |E{h(W)] — E[h(Z)]| + / we dN o-22gy 
= 2 V2n 
< |E[h(W)] — E[h(Z)]| + a. 


To finish the proof of the theorem we will show 


|E[h(W)]| — E[h(Z)]| < 5° 3E[|X3 |]/a, 


11 


a= S250 
i=1 
P(W <2)—-P(Z<2z) <2 35> EILXi 
| i=1 


and then by choosing 


we get 
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Repeating the same argument starting with 


P(Z <z)— P(W <2z) 
< P(Z <z)- Eh(Z+a)4+ Eh(Z +a) — Eh(W +0) 
< |E|A(W + a)| — E[h(Z + a)|| + Piz < Z<z+a), 


the theorem will be proved. 


To do this let W; = W — Xj, and let Y; be a random variable 
independent of all else and having the same distribution as X;. Using 
Var(W) = Yn E[Y?] = 1 and EX: f(W;)] = E[XJE[f(Wi)] = 0 
in the second equality below, and the preceding lemma with |W — 
W; — t| < |t| + |X;| in the second inequality below, we have 


|E(A(W)] — E[h(Z)]| 
= |E[f'(W) —- WF(W)]| 


=|5— Ely? f'(W) — Xi(f(W) — f(Wa))I| 


t=1 
n Yi 
=| oem [or ) - ri + sae] 
i=l 0 
n ¥; 
< oem [Is w) £4 Bae 
i=l 0 
n Y; D) 
soem fo Ble + xa 
and continuing on from this we get 
=) ElIX?lI/0 + 2E[X7]E[Xil]/0 


i=1 


where in the last line we use E[|X;|]E[X?] < E[|X;|°] (which follows 
since |X;| and X? are both increasing functions of |X;| and are, thus, 


positively correlated (where the proof of this result is given in the 
following lemma). @ 


2.6 Stein's Method for the Normal Distribution 13 


Lemma 2.21 If f(x) and g(x) are nondecreasing functions, then for 
any random variable X 


Elf(X)g(X)] = ELF(X)Elg(X)| 


Proof. Let X, and X2 be independent random variables having the 
same distribution as X. Because f(X,) — f(X2) and g(X1) — 9(X2) 
are either both nonnegative or are both nonpositive, their product is 
nonnegative. Consequently, 


El(f(X1) — f(X2))(9(%1) — g(X2))] 2 0 
on equivalently, 
E[f (X1)9(X1)] + Elf (X2)g(X2)| = Elf(X1)9(X2)] + Elf(X2)9(X1)] 
But: by independence. 
Elf (X1)9(X2)) = Elf (X2)9(X1)] = Elf (X)|Elg(X)! 


and the result follows. @ 


The preceding results yield the following version of the central 
limit theorem as an immediate corollary. 


Corollary 2.22 If Z ~ N(0,1) and Y;,Yo,... are tid random vari- 
ables with E[Y;] = u, Var(Y;) = 0? and E[|Y;|*] < 00, then asn — 00 


we have . 
1 
P(—= 
Taos 


Proof. Letting X; = “54,i > 1, and W, = 32%, Xi then W, 


ao/n? 
satisfies the conditions of Theorem 2.20. Because 


y= 
—" <2) > P(Z <2). 


nE||Yi — I] 


> ElLXP] = nE[X1)"] = — 35 


i=] 


— 0 


it follows from Theorem 2.20 that P(W, <x)—- P(Z <2). m 
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2.( Exercises 


1. 


If X ~ Poisson(a) and Y ~ Poisson(b), with b > a, use coupling 
to show that Y > X. 


. Suppose a particle starts at position 5 on a number line and at 


each time period the particle moves one position to the right 
with probability p and, if the particle is above position 0, moves 
one position to left with probability 1 — p. Let X,(p) be the 
position of the particle at time n for the given value of p. Use 
coupling to show that X,(a) >3¢ Xn(b) for any n if a > b. 


. Let X,Y be indicator variables with E|X] = a and EY] = b. 


(a) Show how to construct a maximal coupling X,Y for X and 
Y, and then compute P(X = Y) as a function of a,b. (b) Show 
how to construct a minimal coupling to minimize P(X = Y). 


. In a room full of n people, let X be the number of people who 


share a birthday with at least one other person in the room. 
Then let Y be the number of pairs of people in the room having 
the same birthday. (a) Compute E[X] and Var(X) and E[Y] 
and Var(Y). (b) Which of the two variables X or Y do you 
believe will more closely follow a Poisson distribution? Why? 
(c) In a room of 51 people, it turns out there are 3 pairs with 
the same birthday and also a triplet (3 people) with the same 
birthday. This is a total of 9 people and also 6 pairs. Use a 
Poisson approximation to estimate P(X > 9) and P(Y > 6). 
Which of these two approximations do you think will be better? 
Have we observed a rare event here? 


. Compute a bound on the accuracy of the better approximation 


in the previous exercise part (c) using the Stein-Chen method. 


. For discrete X,Y prove dry(X,Y) = $ >>, |P(X =2)—-P(Y = 


x)| 


. For discrete X,Y show that P(X #4 Y) > dry(X,Y) and show 


also that there exists a coupling that yields equality. 


. Compute a bound on the accuracy of a normal approximation 


for a Poisson random variable with mean 100. 
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9. 


10. 


11. 


12. 


13. 


14. 


If X ~ Geometric(p), with g = 1 — p, then show that for any 


bounded function f with f(0) = 0, we have E[f(X) —qf(X + 
1)] =0. 


Suppose X,,X,... are independent mean zero random vari- 
ables with |X,| < 1 for all n and >/,.,, Var(Xi)/n — 8 < 00. 
If Sp = Vicn Xi and Z ~ N(0,s), show that S,//n —, Z. 


Suppose m balls are placed among n urns, with each ball inde- 
pendently going in to urn 2 with probability p;. Assume m is 
much larger than n. Approximate the chance none of the urns 
are empty, and give a bound on the error of the approximation. 


A ring with a circumference of c is cut into n pieces (where 
n is large) by cutting at n places chosen uniformly at random 
around the ring. Estimate the chance you get k pieces of length 
at least a, and give a bound on the error of the approximation. 


Suppose X;, 7 = 1,2,...,10 are iid U(0,1). Give an approxima- 


tion for P(S>;2, Xi > 7), and give a bound on the error of this 
approximation. 


Suppose X;, i = 1,2,...,n are independent random variables 
with E[X;] = 0, and Soy, Var(Xi) = 1. Let W = 01, Xi 


and show that 
E|Ww\|- V2 
T 


Tr 
< 35° E|X;). 
i=1 


Chapter 3 


Conditional Expectation and 
Martingales 


3.1 Introduction 


A generalization of a sequence of independent random variables oc- 
curs when we allow the variables of the sequence to be dependent 
on previous variables in the sequence. One example of this type of 
dependence is called a martingale, and its definition formalizes the 
concept of a fair gambling game. A number of results which hold for 
independent random variables also hold, under certain conditions, 
for martingales, and seemingly complex problems can be elegantly 
solved by reframing them in terms of a martingale. 

In section 2 we introduce the notion of conditional expectation, in 
section 3 we formally define a martingale, in section 4 we introduce 
the concept of stopping times and prove the martingale stopping 
theorem, in section 5 we give an approach for finding tail probabilities 
for martingale, and in section 6 we introduce supermartingales and 
submartingales, and prove the martingale convergence theorem. 


3.2 Conditional Expectation 


Let X be such that E||X|| < co. In a first course in probability, 
E|[X|Y], the conditional expectation of X given Y, is defined to be 


17 
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that function of Y that when Y = y is equal to 


>>, 2P(X = 2lY = y), 
if X,Y are discrete 


E(X|Y =y|= 
J xfxiy(zly)dz, 
if X,Y have joint density f 
= fey) _ Flay) 
__ f(z) _ fey) 


The important result, often called the tower property, that 
E|X] = ElE[X|Y]] 
is then proven. This result, which is often written as 
Dy FIXIY = ylP(Y = y) 


if X,Y are discrete 
E(x| 


J E[X|Y = ylfy(y)dy, 
if X,Y are jointly continuous. 


is then gainfully employed in a variety of different calculations. 

We now show how to give a more general definition of conditional 
expectation, that reduces to the preceding cases when the random 
variables are discrete or continuous. To motivate our definition, sup- 
pose that whether or not A occurs is determined by the value of Y. 
(That is, suppose that A € o(Y).) Then, using material from our 
first course in probability, we see that 


E[XI4] = ElE|XIal¥]] 


E\IAE|X\|Y] 


where the final equality holds because, given Y, I, is a constant 
random variable. 


We are now ready for a general definition of conditional expecta- 
tion. 
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Definition For random variables X,Y, let E|X|Y], called the condi- 
tional expectation of X given Y, denote that function h(Y) having 
the property that for any A € o(Y) 


E{XIa] = E[A(Y) Ja]. 


By the Radon-Nikodym Theorem of measure theory, a function 
h that makes h(Y) a (measurable) random variable and satisfies the 
preceding always exists, and, as we show in the following, is unique 
in the sense that any two such functions of Y must, with probability 


1, be equal. The function h is also referred to as a Radon-Nikodym 
derivative. 


Proposition 3.1 If h, and ha are functions such that 
E{hi(Y Ia] = Elho(Y)Ia4] 
for any AE o(Y), then 
P(Ay(Y) = ho(Y)) = 1 


Proof. Let An = {hi(Y) — ho(Y) > 1/n}. Then, 


0 = Elhi(Y)Ia,] — Elho(Y)Ia,] 
= El(hi(Y) — he(Y))a,] 
2 ~P(An) 


showing that P(A,) = 0. Because the events A, are increasing in n, 
this yields 


0 = lim P(An) = P(lim An) = P(UnAn) = P(ha(Y) > ha(¥)) 


Similarly, we can show that 0 = P(hi(Y) < he(Y)), which proves 
the result. @ 


We now show that the preceding general definition of conditional 
expectation reduces to the usual ones when X,Y are either jointly 
discrete or continuous. 
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Proposition 3.2 If X and Y are both discrete, then 


E[X|Y =y] =) cP(X =2l¥ =y) 
4 
whereas if they are jointly continuous with joint density f, then 


EIX'Y =) =f afxyr(clu)ae 


where fxjy(zly) = Tene: 


Proof Suppose X and Y are both discrete, and let 
h(y) = >_ «P(X =a2l¥Y =y) 
zr 


For A € o(Y), define 
B= {y:I,4 =1 when Y = y} 


Because B € o[Y), it follows that [4 = Ig. Thus, 


xz, if X=2z,YEB 
aS { 0, otherwise 


and 
_f>X2P(x =clY=y), if Y=yeB 
ac { 0, otherwise 


Thus, 
E|XI4] = ) > 2P(X =2,Y € B) 
M 


= Ye) P(X =2,Y =y) 


x yeB 


> > gP(X =2|Y =y)P(Y =y) 


yEB 2x 
E[h(Y La] 
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The result thus follows by uniqueness. The proof in the continuous 
case is similar, and is left as an exercise. a 


For any sigma field F we can define E|X|F] to be that random 
variable in F having the property that for all A € F 


E|X 14] = E{E[X|F] Ia] 


Intuitively, E|X|F] represents the conditional expectation of X given 
that we know all of the events in F that occur. 


Remark 3.3 It follows from the preceding definition that E[X|Y] = 
E[X|o(Y)}]. 


For any random variables X, X1,..., Xn, define E[X|X,,...,Xn] by 
E[X|X1,...,Xn| = E[X|o(%,...,Xn)] 


In other words, E[X|X,,..., Xn] is that function h(Xj,...,Xp) for 
which 


E[XI4] = Efh(Xq,...,Xp)La], for all A€ o (X1,..., Xn) 


We now establish some important properties of conditional ex- 
pectation. 


Proposition 3.4 
(a) The Tower Property: For any sigma field F 


B[X] = E[E[X|F] 
(b) For any AEF, 
E[XIa|F] = [4E[X|F] 
(c) If X is independent of all Y € F, then 
E|X|F] = E[X] 
(d) If E[|X;|] < 00, i=1,...,n, then 


El) | Xi\F] = S— EXIF] 
i= i=l 
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(e) Jensen’s Inequality If f is a convex function, then 
ELf(X)|F] = f(ELX|F)) 
provided the expectations exist. 


Proof Recall that E|X|F] is the unique random variable in F such 
that 


E(XI4] = ElE[X|Fll4] if AEF 


Letting A = Q, I4 = Ip = 1, and part (a) follows. 
To prove (b), fix A € F and let X* = XI4. Because E[X*|F] is 
the unique function of Y such that E[X*I4)] = E[E[X*|F]L4:] for 


all A’ € F, to show that E[X*|F] = I,E[X|F], it suffices to show 
that for A’ € F 


E|X* Ia] = EUaE|X|F] La] 
That is, it suffices to show that 
E([XIala| = EWAE|X|F\La') 
or, equivalently, that 
E[XIaa'\ = E[E[X|F laa’) 


which, since AA’ € F, follows by the definition of conditional expec- 
tation. 


Part (c) will follow if we can show that, for A € F 
E(XIa] = E{E[X]I4] 
which follows because 
E[|XI4) = EXEL] by independence 
= E[E[X]I4] 
We will leave the proofs of (d) and (e) as exercises. 


Remark 3.5 It can be shown that E[X|F] satisfies all the properties 
of ordinary expectations, except that all probabilities are now com- 
puted conditional on knowing which events in F have occurred. For 
instance, applying the tower property to E[X|F] yields that 


E|X|F| = E|E[X|F U G]|F}. 
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While the conditional expectation of X given the sigma field F 
is defined to be that function satisfying 


E(|XI4] = E[E|X|F\I4| for all Ac F 
it can be shown, using the dominated convergence theorem, that 
E|XW] = E[E[X|F|W] for all W Ee F (3.2) 
The following proposition is quite useful. 
Proposition 3.6 If W € F, then 
E|XW|F| = WE[X|F] 


Before giving a proof, let us note that the result is quite intuitive. 
Because W € Ff, it follows that conditional on knowing which events 
of F occur - that is, conditional on F - the random variable W 
becomes a constant and the expected value of a constant times a 
random variable is just the constant times the expected value of the 
random variable. Next we formally prove the result. 

Proof. Let 


Y 


E|XW|F\ —-WE|X|F] 
= (X — E[X|F])W - (XW — E[Xw\F)) 
and note that Y € F. Now, for AE fF, 
E{Y Ig] = E[(X — E[X|F]|)WIa] — E[(XW — E[XW|F))I4] 
However, because WI, € F, we use Equation (3.2) to get 
"BUX — E[X|F)W1a] = ELXW14] — BLELX|FIW 14) = 0. 
and by the definition of conditional expectation we have 
E((XW — E[XW|F])Ia)| = E[XWIa] — E[E[XW|F]I4] = 0. 
Thus, we see that for A € Ff, 
E|YI4] =0 


Setting first A = {Y > 0}, and then A= {Y < 0} (which are both 
in F because Y € F) shows that 


P(Y >0) = P(Y <0)=0 


Hence, Y = 0, which proves the result. @ 
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3.3. Martingales 


We say that the sequence of sigma fields Fi, Fo... is a filtration if 
Fi C Fe.... We say a sequence of random variables X, X2,... is 
adapted to F, if Xn € Fp for all n. 

To obtain a feel for these definitions it is useful to think of n 
as representing time, with information being gathered as time pro- 
gresses. With this interpretation, the sigma field F, represents all 
events that are determined by what occurs up to time n, and thus 
contains F,,_1. The sequence X,,n > 1, is adapted to the filtration 


Fn,n > 1, when the value of X, is determined by what occurs by 
time n. 


Definition 3.7 Z, is a martingale for filtration F,, if 
1. E[|Zn|] < co 
2. Zp 1s adapted to F,, 
3. E|Zn+1|Fn] = Zn 


A bet is said to be fair it its expected gain is equal to 0. A 
martingale can be thought of as being a generalized version of a fair 
game. For consider a gambling casino in which bets can be made 
concerning games played in sequence. Let 7, consist of all events 
whose outcome is determined by the results of the first n games. Let 
Zn denote the fortune of a specified gambler after the first n games. 
Then the martingale condition states that no matter what the results 
of the first n games, the gambler’s expected fortune after game n+ 1 
is exactly what it was before that game. That is, no matter what has 
previously occurred, the gambler’s expected winnings in any game is 
equal to 0. 


It follows upon taking expectations of both sides of the final mar- 
tingale condition that 


E|Zn41| a E|Zn] 


implying that 
E|Z,] = E{Z1] 
We call E(Z;] the mean of the martingale. 


3.3 Martingales 85 


Another useful martingale result is that 


E|Zn+2\Fa| = ElE|Zn+2|Fn+i U Fn]lFal 
E(E|Zn+2|Fnti]|Fnl 

ors E(Zn+1|Fn] 

= 2 


Repeating this argument yields that 
E[Zn+k|Fn] = Zn, k>1 


Definition 3.8 We say that Z,,n > 1, is a martingale (without spec- 
ifying a filtration) if 


E||Zn|] < co 
2. E\ZatilFiysn24 Zul = Zn 


If {Z,,} is a martingale for the filtration F,, then it is a martin- 
gale. This follows from 


E[Zn41|Z1,---, Zn] = ELE[Zn41|Z1,.. Fall Zigesas Sal 
= EE| Ae. - A 
= E[Z,|Z,.. Zn] 
ee 


where the second equality followed because 2; € Fn, for all i = 
Lewieed Ms 


We now give some examples of martingales. 


Example 3.9 If X;,i > 1, are independent zero mean random vari- 
ables, then Z, = )o7_, Xi, n > 1, is a martingale with respect to 
the filtration o(X,,...,Xn),n > 1. This follows because 


BZ Xieesan Xai 


E|Zn + Xn4i1|X1,.-.,Xn] 
E[ZnlXi,-.. Xn] + E[Xnaa|X,-..5 Xa] 
= 2n+ E[Xn41] 
by the independence of the X; 
= Ly | 
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Example 3.10 Let X;,i > 1, be independent and identically dis- 


tributed with mean zero and variance a”. Let S, = >-7_, Xi, and 
define 


Zn = S2—no*, n>1 


Then Z,,n > 1 is a martingale for the filtration o(Xj,...,X,). To 
verify this claim, note that 


E(S2,,|X1,---, Xn] = El(Sn + Xn41)7|X1,.--, Xn] 
= E[S2|X1,...,Xn] + E]2SaXnoi|X1,.--, Xn 
+E[X?2,5|X1,---,Xn] 
S? + 2S, E[Xn41|X1,-.-,Xn] + E[X2,4] 
S? + 2S, E[Xn41| + 0? 
= $2740? 


Subtracting (n + 1)o? from both sides, yields 
ElZesilX1,---,Xn]=Zn 


Our next example introduces an important type of martingale, 
known as a Doob martingale. 


Example 3.11 Let Y be an arbitrary random variable with E[|Y || < 
oo, let F,,n > 1, be a filtration, and define 


Zn = E[Y|Frl 


We claim that Z,,n > 1, is a martingale with respect to the filtration 
Fn,n > 1. To verify this, we must first show that E{|Z,|] < 00, which 
is accomplished as follows. 


El\Zn\) = ELEY |Fall) 
< ELEY ||Fal) 
= E{\y|) 
< © 


where the first inequality uses that the function f(z) = |z| is convex, 
and thus from Jensen’s inequality 


EU\Y||Fn) 2 [EY Fal 
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while the final equality used the tower property. The verification is 
now completed as follows. 
E\Zn4i1\Fn] = ELEY |FatillFal 


ElY|F,| by the tower property 
7 


The martingale Z,,n > 1, is a called a Doob martingale. a 


Example 3.12 Our final example generalizes the result that the suc- 
cessive partial sums of independent zero mean random variables con- 
stitutes a martingale. For any random variables X;,i > 1, the ran- 
dom variables X; — E[X;|X1,...,Xi-1], 7 > 1, have mean 0. Even 
though they need not be independent, their partial sums constitute 
a martingale. That is, we claim that 


n 


Zn = )-(Xi — E[Xi|X1,...,Xi-i]), n> 1 


i=1 


is, provided that E|[|Z,|] < oo, a martingale with respect to the 
filtration o(X1,...,Xn),n > 1. To verify this claim, note that 


Zn+1 oe Zn Px Xn+1 ~~ E(Xn41|X1, ane Xn 
Thus, 


E(Zn+1|X1, ay ., Xn =Z,+ E(Xn41|X1, a 2. 
E{Xn41|X1, ves Xn] 
= 7% | 


3.4 The Martingale Stopping Theorem 


The positive integer valued, possibly infinite, random variable N is 
said to be a random time for the filtration F, if {N = n} € F, or, 
equivalently, if {N > n} € Fy-1, for all n. If P(N < co) = 1, then 
the random time is said to be a stopping time for the filtration. 
Thinking once again in terms of information being amassed over 
time, with F,, being the cumulative information as to all events that 
have occurred by time n, the random variable N will be a stopping 
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time for this filtration if the decision whether to stop at time n (so 
that N = n) depends only on what has occurred by time n. (That is, 
the decision to stop at time n is not allowed to depend on the results 
of future events.) It should be noted, however, that the decision 
to stop at time n need not be independent of future events, only 
that it must be conditionally independent of future events given all 
information up to the present. 


Lemma 3.13 Let Z,,n > 1, be a martingale for the filtration Fp. 
If N is a random time for this filtration, then the process Zn = 


Zmin(N,n)» % = 1, called the stopped process, is also a martingale for 
the filtration F,. 


Proof Start with the identity 


Zn a inl v Ipn>n}(Zn a Zn-1) 


To verify the preceding, consider two cases: 


(1) N>n: Here, Z, = Zn, Zn-1 = Zn-1; I;n>n} = 1, and the 
preceding is true 


(2) N <n: Here, Z, = Zn-1 = Zn, Ipn>n} = 0, and the 
preceding is true 
Hence, 


E[2ZnlFn-1| aa E[Zn-1|Fn-1] a ElIpn>n}(Zn _ n—1)|Fn-1| 
n-1 + Ipn>n}E[Zn — 2n-1|Fn-1] 
= 4n-1 = 


Theorem 3.14 The Martingale Stopping Theorem Let Z,,, n> 1, 


be a martingale for the filtration F,,, and suppose that N is a stopping 
time for this filtration. Then 


E(Zy] = E[Z,] 
if any of the following three sufficient conditions hold. 
(1) Zn are uniformly bounded; 


(2) N ts bounded; 
(3) E[N] < co, and there exists M < 00 such that 


El |Zn+1 a Zn\|Fn] <M 
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Proof. Because the stopped process is also a martingale, 
E[Z,] = E[Z1] = E[Z,}. 


Because P(N < oo) = 1, it follows that Zn = Zn for sufficiently 
large N, implying that 


lim Ze —— A N- 

nlt—> OO 
Part (1) follows from the bounded convergence theorem, and part 
(2) with the bound N < n follows from the dominated convergence 


theorem using the bound |Z;| < Dt |Z;|. 
To prove (3), note that with Zp) = 0 


IZnl = |90(4 - Zi-1)| 
i=1 
= — —_ 
< S 51% - Zi-11 
i=l 


(o @) 
= S- Lw>a |Z = Zi), 


and the result now follows from the dominated convergence theorem 
because 


e.@) oo 
E\)” Ipn>y1Zi — Zi-al] = > Ell n>i3|Zi — Zi-1l] 


i=l] i=] 


ie. @) 
= > ELE Un >iy14i — 2:-1||Fi-1]] 


a Dd Ellen Ell — — Z-1||Fi-a]] 
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A corollary of the martingale stopping theorem is Wald’s equa- 
tion. 


Corollary 3.15 Wald’s Equation Jf X,, Xo,... are independent and 
identically distributed with finite mean wp = E[X;], and if N is a 
stopping time for the filtration F, = 0(X},...,Xn),n > 1, such that 
EN] < 00, then 


N 
EID, Xj] = pE[N] 


Proof. Zn, = >>j_,(Xi — w),n > 1, being the successive partial 
sums of independent zero mean random variables, is a martingale 
with mean 0. Hence, assuming the martingale stopping theorem can 
be applied we have that 


0 = E[Zn] 


N 
= El) (% — p)| 


N 
= El) Xi - Nu)] 


a 
= ED Xi] — E[Nyp| 


To complete the proof, we verify sufficient condition (3) of the mar- 
tingale stopping theorem. 


E||Zn41 oan Zn\|Fr] E||Xn+1 = L\|Fr] 


E[|Xna1 — pl] by independence 
E||Xi|] + |u| 


Thus, condition (3) is verified and the result proved. m 


IA 


Example 3.16 Suppose independent and identically distributed dis- 
crete random variables X,, X2,..., are observed in sequence. With 
P(X; = j) = pj, what is the expected number of random variables 
that must be observed until the subsequence 0, 1, 2,0, 1 occurs? What 
is the variance? 
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Solution Consider a fair gambling casino - that is, one in which the 
expected casino winning for every bet is 0. Note that if a gambler 
bets her entire fortune of a that the next outcome is j, then her 
fortune after the bet will either be 0 with probability 1 — p,, or a/p; 
with probability p;. Now, imagine a sequence of gamblers betting 
at this casino. Each gambler starts with an initial fortune 1 and 
stops playing if his or her fortune ever becomes 0. Gambler 7 bets 
1 that X; = 0; if she wins, she bets her entire fortune (of 1/po) 
that Xj;11 = 1; if she wins that bet, she bets her entire fortune 
that Xij,9 = 2; if she wins that bet, she bets her. entire fortune 
that X;,3 = 0; if she wins that bet, she bets her entire fortune that 
Xii4 = 1; if she wins that bet, she quits with a final fortune of 
(pgpip2)—*. 

Let Z,, denote the casino’s winnings after the data value X,, is 
observed; because it is a fair casino, Z;,n > 1, is a martingale with 
mean 0 with respect to the filtration o(X),...,Xn),n > 1. Let N 
denote the number of random variables that need be observed until 
the pattern 0,1,2,0,1 appears - so (Xyn_4,...,Xn) = (0,1, 2,0, 1). 
As it is easy to verify that N is a stopping time for the filtration, 
and that condition (3) of the martingale stopping theorem is satisfied 
when M = 4/(p2p?p2), it follows that E[Zy] = 0. However, after 
Xn has been observed, each of the gamblers 1,..., N —5 would have 
lost 1; gambler N — 4 would have won (p2p?p2)~1 — 1; gamblers N —3 
and N — 2 would each have lost 1; gambler N — 1 would have won 
(popi)~' — 1; gambler N would have lost 1. Therefore, 


Zn = N — (popip2)~* — (pop) 
Using that E[Zy]| = 0 yields the result 


E(N] = (popip2)* + (pop) 


In the same manner we can compute the expected time until any 
specified pattern occurs in independent and identically distributed 
generated random data. For instance, when making independent flips 
of a coin that comes u heads with probability p, the mean number 
of flips until the pattern HHTTHH appears is p-4q~2 +p"? +p}, 
where gq = 1 —p. 

To determine Var(N), suppose now that gambler 7 starts with an 
initial fortune 7 and bets that amount that X; = 0; if she wins, she 
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bets her entire fortune that X;,1; = 1; if she wins that bet, she bets 
her entire fortune that X;,9 = 2; if she wins that bet, she bets her 
entire fortune that X;,3 = 0; if she wins that bet, she bets her entire 
fortune that X;.4 = 1; if she wins that bet, she quits with a final 
fortune of i/(p2p2p2). The casino’s winnings at time N is thus 


N-4 N-1 
pepep2 —~ PoP 
N(N+1) N-4 N-1 

2 poppe pop 


Assuming that the martingale stopping theorem holds (although 
none of the three sufficient conditions hold, the stopping theorem 
can still be shown to be valid for this martingale), we obtain upon 
taking expectations that 


Zn = 1+2+...4+N—- 


E[N?] + E[N] =2 oes ae eich ea 
oP1P2 POoP1 
Using the previously obtained value of E[N], the preceding can now 
be solved for E[N2], to obtain Var(N) = E[N?]—(E[N])?. 


Example 3.17 The cards from a shuffled deck of 26 red and 26 black 
cards are to be turned over one at a time. At any time a player 
can say “next”, and is a winner if the next card is red and is a 
loser otherwise. A player who has not yet said “next” when only a 
single card remains, is a winner if the final card is red, and is a loser 
otherwise. What is a good strategy for the player? 


Solution Every strategy has probability 1/2 of resulting in a win. 
To see this, let R, denote the number of red cards remaining in the 
deck after n cards have been shown. Then 


E[RntilRi,.-+yRq] = Rn — got = Re 
Hence, an, n > 0 is a martingale. Because Rp /52 = 1/2, this 
martingale has mean 1/2. Now, consider any strategy, and let N 
denote the number of cards that are turned over before “next” is 
said. Because N is a bounded stopping time, it follows from the 
martingale stopping theorem tahat 


Ry 
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Hence, with J = I {win} 
Rn 
52—N 


Our next example involves the matching problem. 


E\I| = E[ElI|Rn]] = E| J=1/2 # 


Example 3.18 Each member of a group of n individuals throws his 
or her hat in a pile. The hats are then mixed together, and each 
person randomly selects a hat in such a manner that each of the 
n! possible selections of the n individuals are equally likely. Any 
individual who selects his or her own hat departs, and that is the 
end of round one. If any individuals remain, then each throws the 
hat they have in a pile, and each one then randomly chooses a hat. 
Those selecting their own hats leave, and that ends round two. Find 
E([N], where N is the number of rounds until everyone has departed. 


Solution Let X;,7 > 1, denote the number of matches on round 1 
fori =1,...,N, and let it equal 1 for 1 > N. To solve this example, 
we will use the zero mean martingale Z;,,k > 1 defined by 


k 
Ze = > (Xi- a \X1,...,Xi-1]) 


where the final equality follows because, for any number of remaining 
individuals, the expected number of matches in a round is 1 (which 
is seen by writing this as the sum of indicator variables for the events 
that each remaining person has a match). Because 


k 
N=min{k: So Xi =n} 
i=] 


is a stopping time for this martingale, we obtain from the martingale 
stopping theorem that 


0 = E[Zy] = ety. X- ~ N] =n-E([N} 


and so E|[N|=n. mf 


94 Chapter 3 Conditional Expectation and Martingales 


Example 3.19 If X,, Xo,..., is a sequence of independent and iden- 
tically distributed random variables, P(X; = 0) # 1, then the process 


Tr 
i=1 


is said to be a random walk. For given positive constants a, b, let p 
denote the probability that the the random walk becomes as large as 
a before it becomes as small as —b. 

We now show how to use martingale theory to approximate p. In 
the case where we have a bound |X;| < c it can be shown there will 
be a value 0 4 0 such that 


Ele®*] = 1. 


Then because : 
Z, = eS = I] OX: 


i=1 
is the product of independent random variables with mean 1, it fol- 
lows that Z,,n > 1, is a martingale having mean 1. Let 


N=min(n:S, >a or S,<-b). 


Condition (3) of the martingale stopping theorem can be shown to 
hold, implying that 
Ele’°*] =1 
Thus. 
1 = Efe’ |Sx > alp + Ele’®® |Su < —b](1 — p) 
Now, if 6 > 0, then 


et < E[e®5% \Sn >a] < eflate) 


and 
e I(bt+e) < Ele" |Sn < —b < e 9 
yielding the bounds 


1—e—% 1 — e~ Abte) 
ef(atc) — e—0b — PS e9a _ e—O(b+c)’ 
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and motivating the approximation 


Lae! 
PS ba — o—0b’ 
We leave it as an exercise to obtain bounds on p when @ < 0. a 


Our next example involves Doob’s backward martingale. Before 
defining this martingale, we need the following definition. 


Definition 3.20 The random variables X1,...,Xn are said to be ez- 


changeable if X;,,...,Xi, has the same probability distribution for 
every permutation 71,...,%m of 1,. 


Suppose that Xj,...,Xpn are exchangeable. Assume E|[|X;,|] < 


oo, let 
j 
S.= >) Xi, Dia he ee 
t=1 


and consider the Doob martingale Z),...,2Z,, given by 
Z;j = E(X1|Sn, Sn-1,---) Sn4i—j]; g=l,...,n 


However, 


Snti-j = El[Sn41—j|Sn4i—j, Xn42-j,---) Xn] 
n+1—j 


> E[Xi|Sn+41-3, Xn42—jo-- +) Xn 
i=1 


(n+1—J)E[X1|Sn41-j, Xn¢2-j,---,Xn] 
(by exchangeability) 
(nt+1- j)Z; 


where the final equality follows since knowing S,, Sn-1,.. 
is equivalent to knowing Sn41-j, Xn42-j,---,Xn- 
The martingale 


-y On41-j 


Sn41-j 


———=.,j=]l.,..., 
core er i 


eas 


is called the Doob backward martingale. We now apply it to solve the 
ballot problem. 
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Example 3.21 In an election between candidates A and B candidate 
A receives n votes and candidate B receives m votes, where n > m. 
Assuming, in the count of the votes, that all orderings of the n +m 


votes are equally likely, what is the probability that A is always ahead 
in the count of the votes? 


Solution Let X; equal 1 if the i*” voted counted is for A and 
let it equal —1 if that vote is for B. Because all orderings of the 
n-+m votes are assumed to be equally likely, it follows that the 
random variables X1,...,Xn+m are exchangeable, and Z1,..., Zn4m 
is a Doob backward martingale when 


 _ Sntm+i-j 
J nt+m41—-j 


where S, = D>;_, Xi. Because Z, = Snim/(n+m) = (n—m)/(n+ 
m), the mean of this martingale is (n —m)/(n+m). Because n > m, 
A will always be ahead in the count of the vote unless there is a tie 
at some point, which will occur if one of the S; - or equivalently, one 
of the Z; - is equal to 0. Consequently, define the bounded stopping 
time N by 


N =min{j: Z; =Oorj=n+m} 


Because Znim = X1, it follows that Zn will equal 0 if the candidates 
are ever tied, and will equal X, if A is always ahead. However, if A 
is always ahead, then A must receive the first vote; therefore, 


Iu = 1, if A is always ahead 
le 0, otherwise 


By the martingale stopping theorem, E[Zy| = (n — m)/(n +m), 
yielding the result 
n-m 


P(A is always ahead) = are 
n 


3.5 The Hoeffding-Azuma Inequality 
Let Z,,n > 0, be a martingale with respect to the filtration F,. 


If the differences Z, — Z,—, can be shown to lie in a bounded ran- 
dom interval of the form [—B,,-Bn+d,] where B, € Fn-1 and dy 
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_ is constant, then the Hoeffding-Azuma inequality often yields useful 
bounds on the tail probabilities of Z,. Before presenting the inequal- 
ity we will need a couple of lemmas. 


Lemma 3.22 If E[X] = 0, and P(—a < X < B) = 1, then for any 
conver function f 


BUX) < ST ah(-a) + 78) 


Proof Because f is convex it follows that, in the region —a < x < £, 
it is never above the line segment connecting the points (—a, f(—a)) 
and (3, f(@)). Because the formula for this line segment is 


f(—a) + =“ F(8) + — 148) - f(-a)]e 


we obtain from the condition P(—a < X < 8) = 1 that 


= 


#(X) < Ea f(a) + 2 (6) + —s(6) - f(-a))x 
Taking expectations, and using that E[X]=0, yield the result. m 
Lemma 3.23. For 0<p<1 

pel-P) 4 (1 — pe? < et’ /8 


' Proof Letting p= (1+ a)/2 and t = 26, we must show that 
for -l<a<l 


(1 + a)eFl-©) 4 (1 — a)e~FUI+2) << 2¢8°/? 
or, equivalently, 
ef +e + a(e® — e7F) < 28+ 0"/2 
The preceding inequality is true when a = —1 or +1 and when |{| is 
large (say when |G| > 100). Thus, if the Lemma were false, then the 


function 


f(a, B) =e +8 + a(e8 — eB) — 2608 +87/2 
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would assume a strictly positive maximum in the interior of the re- 


gion R = {(a, B) : |a| < 1, |B| < 10}. Setting the partial derivatives 
of f equal to 0, we obtain 


e& —e 8 4+ a(e8 +678) = 2aBe%Fth*/2 (3.3) 
ec —e 8 = 26e%t6"/2 (3.4) 


We will now prove the lemma by showing that any solution of (3.3) 
and (3.4) must have 3 = 0. However, because f(a,0) = 0, this would 
contradict the hypothesis that f assumes a strictly positive maximum 
in R , thus establishing the lemma. 

So, assume that there is a solution of (3.3) and (3.4) in which 
B # 0. Now note that there is no solution of these equations for 


which a = 0 and @ ¥ 0. For if there were such a solution, then (3.4) 
would say that 


e® — 8 = 28/2 (3.5) 
But expanding in a power series about 0 shows that (3.5) is equivalent 
t 
? Fey Bt =. Bt 
“(21+ 1)! Le ile 
which (because (22 + ne > i!2' when i > 0) is clearly impossible 
when 3 # 0. Thus, any solution of (3.3) and (3.4) in which 8 4 0 


will also have a # 0. Assuming such a solution gives, upon dividing 
(3.3) by (3.4), that 


lta i Se 


Because a # 0, the preceding is equivalent to 
B(e® +e) =e —eF 
or, expanding in a Taylor series, 
CO gtitl =O grit 
which is clearly not possible when 3 4 0. Thus, there is no solution 


of (3.3) and (3.4) in which @ # 0, thus proving the result. a 


We are now ready for the Hoeffding-Azuma inequality. 
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Theorem 3.24 The Hoeffding-Azuma Inequality Let Z,, n > 
1, be a martingale with mean p with respect to the filtration F,. 
Let Zp = p and suppose there exist nonnegative random variables 


B,,n > 0, where B, € Fn-1, and positive constants d,,n > 0, such 
that 


—B, < Zn ory Zn-1 < —Br + dy 
Then, forn > 0,a > 0, 
e720? / 1 d? 


e20*/ i (3.6) 


(i) P(Z,—p2>a) 
(ii) P(Z,—p< —a) 


IN IA 


Proof Suppose that p = 0. For any c > 0 


P(Zpn >a) = P(e" > e) 
< e @ Bleo4"] 


where we use Markov’s inequality in the second equality. Let W, = 
e°4n_ Note that Wo = 1, and that for n > 0 


E[WalFn—1] = Elec2n-* e(4--2--)/F,,_ 
us ecfn-1 Efe(42-4n-) | Fa] 


Wa1 Ele42-2"-1) | F,_ 4] (3.7) 


where the second equality used that Z,_1 € Fy_. Because 
(i) f(z) =e is convex; 


(ii) 


E([Zn = Z. ~1|Fn-1] 


E|Z,|Fn-1| 7 E|Zn-1|Fn-1| 
Zn-1 — Zn-1 = 0 


and 
(iii) —By < Zn > Zn-1 < —Bn i dn 
it follows from Lemma 3.22, with a = B,, @ = —B, + dn, that 


—Bn + dn)e~°B" + Bnet (— Bata) 
pp Er 
(—B, + dy,)e~ oP + B,,e¢ (Bn t4n) 
dn 


Elet(4n—2n-1) Fest 


IA 
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where the final equality used that B, € F,;-1. Hence, from Equation 
(3.7), we see that 


A 


w._, CBatdnje oh + Bnee Pett) 
a n-—-1 dn 
< Wp—1e° /8 


E|WnrlFn-1| 


where the final inequality used Lemma 3.23 (with p = B,/dn, t = 
cd, ). Taking expectations gives 


E[W,] < El[Wn-ijee &® 


Using that E[Wo] = 1 yields, upon iterating this inequality, that 
nn 
E[Wa] < [[ P88 =e? Din a 
t=) 
Therefore, from Equation (3.6), we obtain that for any c > 0 


Tm 
P(Zn > a) < exp(—ca+ c? S| d?/8) 


t=1 
Letting c = 4a/>>7_, d?, (which is the value of c that minimizes 
—ca+ c? \>°_, d?/8), gives that 
PLZ, > a) < e720? / ini d? 


Parts (i) and (ii) of the Hoeffding-Azuma inequality now follow from 
applying the preceding, first to the zero mean martingale {Z, — p}, 
and secondly to the zero mean martingale {y — Z,}. a 


Example 3.25 Let X;,i > 1 be independent Bernoulli random vari- 
ables with means p;,i = 1,...,n. Then 


67 6) 
Ln = Yo% — pi) = Sn - >> Dis n20 
i=] i=1 


is a martingale with mean 0. Because Z, — Zn_1 = Xn — p, we see 
that 


—pS Zn, -4n,-1S1-p 
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Thus, by the Hoeffding-Azuma inequality (B, = p,d, = 1), we see 
that for a > 0 


n 
P(S,—Som 2a) < eh 
11 
2 2 
P(Sp — Spi < —a) < e 2a /n 
a=] 


The preceding inequalities are often called Chernoff bounds. & 


The Hoeffding-Azuma inequality is often applied to a Doob type 
martingale. The following corollary is often used. 


Corollary 3.26 Let h be such that if the vectors x = (11,...,Zn) and 


y = (y1,---, Yn) differ in at most one coordinate (that is, for some 
k, x35 = y; for alli #k) then 


|h(x) — Afy)| <1 


Then, for a vector of independent random variables X = (Xj,..., Xn), 
anda >0 


~—2a?/n 


P(A(X) — E[h(X)] 2 a) 
P(A(X) — E[h(X)] < —a) 


<e 
< 


972 
oe 2a*/n 


Proof Let Z = E[h(X)], and Z; = El[h(X)|o(Xj,.-..,X;)], for 
1 =1,...,n. Then Zp,...,Zn is a martingale with respect to the 
filtration o(X,,...,X;),2 = 1,...,n. Now, 


Zi—- Zi. = E{h(X%)|X),...,Xi)] — E[ACK)|X1,..., Xi-1| 
< sup{ E[h(X)|X4, oe Cae Xj = x] 
Zz 
—E{h(X)|X1,..., Xi-1]} 
Similarly, 


Zi = Zi~-} > inf{E[A(X)|X1, Sude's ,Ai-1; Xj; = y| 
— E{h(X)|X1,..., Xi-1]} 
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Hence, letting 
= inf{E[A(X)|X1,..-,Xi-1, Xi = yl — E[A(K)|X1, ..., Xi-a]} 


and d; = 1, the result will follow from the Hoeffding-Azuma inequal- 
ity if we can show that 


sup{ E|A(X)|X1,..-, ,Xj-1, Xi = z}} 


However, with X;_; = (X,..., Xi-1), the left-hand side of the pre- 
ceding can be written as 


sup{ E[A(X)|Xi,...,Xi-1, Xi = 2] 
zy 


= E{A(X)|X), re Xji-1, Xi = y|} 
= sup{ E[A(X1,...,Xi-1, 2, Xi41,---,Xn)|Xi-1] 
zy 


= E[h(X), oe »AG-1, Y, Xi+1, cee , Xn)|Xi—1]} 
= sup{ E[hA(X), ead PD, Cae, ee Aisi, eis , Xn) 
zy 


— E{h(X,...,Xi-1, y, Xi41,- ++) Xn)|Xi-1]} 
<1 


and the proof is complete. a 


Example 3.27 Let X 1, X2,...,Xn be independent and identically 
distributed discrete random variables, with P(X; = j) = p;. With 
N equal to the number of times the pattern 3,4,5,6, appears in the 


sequence X,,X2,...,Xn, obtain bounds on the tail probability of 
N. 


Solution First note that 


n—3 


E{N] = pe E\l {pattern begins at position iy] 


ixl 


= (n — 3)p3pap5pe6 
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With h(2,...,2n) equal to the number of times the pattern 3, 4, 5,6 
appears when X; = 2,1 = 1,...,n, it is easy to see that h satisfies 
the condition of Corollary 3.26. Hence, for a > 0 


P(N — (n—k+1)pspapspe 2 a Se) 


P(N — (n—k + 1)pspapspe < —a 


) Se 

)< e720" /(n—3) = 
Example 3.28 Suppose that n balls are to be placed in m urns, with 
each ball independently going into urn i with probability p;, >", pi = 
1. Find bounds on the tail probability of Y;, equal to the number of 
urns that receive exactly k balls. 


Solution First note that 


™m 
E|Y,| = E bs if {urn 7 has exactly k balls} 


i=1 
»y (")et (l—p)"* 


i=1 


Let Xj; denote the urn in which ball 7 is put, 7 = 1,...,n. Also, 
let hy,(21,...,2n) denote the number of urns that receive exactly k 
balls when X; = 71,1 = 1,...,n, and note that Y, = hy(X1,..., Xn). 
When k = 0 it is easy to see that ho satisfies the condition that if 
x and y differ in at most one coordinate, then |ho(x) — ho(y)| < 1. 
Therefore, from Corollary 3.26 we obtain, for a > 0, that 


™m 
P(¥ — )0(1— pi)" Ba) < on" 
= 
m 2 
P(Yo — }((1- pi)” < -a) < em 


2] 


For 0 < k < n it is no longer true that if x and y differ in at most 
one coordinate, then |hy(x) —hz(y)| < 1. This is so because the one 
different value could result in one of the vectors having one less and 
the other having one more urn with k balls than would have resulted 
if that coordinate was not included. Thus, if x and y differ in at 
most one coordinate, then 


[ha (x) — he(y)| < 2 
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showing that hj(x) = h,(x)/2 satisfies the condition of Corollary 
3.26. Because 


P(Y% — E[¥i] > a) = P(hE(X) — BlhE(K)] > @/2) 


we obtain, fora > 0,0 <k <n, that 


m™m 
P(% — >> ("eta —pi)"* >a) < e elm 
i=1 
ld nr 2 
roi -$.(C)te--wr<-o < -°™ 


i=] 


Of course, 


™m 
P(Yn=1)=) pP=1-P(%,=0) @ 
i=1 


3.6 Submartingales, Supermartingales, and a Con 
vergence Theorem 


Submartingales model superfair games, while supermartingales model 
subfair ones. 


Definition 3.29 The sequence of random variables Z,, n > 1, is said 
to be a submartingale for the filtration Ff,, if 


1. E[|Zn|] < co 
2. Zn ts adapted to Fn 
oe: E [Zn+1|Fn] > Zs 


If 3 is replaced by E|Znii|Fn| < Zn, then Zn,n > 1, is said to be 
supermartingale. 


It follows from the tower property that 
E{Zn41] 2 E[Zn] 


for a submartingale, with the inequality reversed for a supermartin- 
gale. (Of course, if Z,,n > 1, is a submartingale, then —Z,,n > 1, 
is a supermartingale, and vice-versa.) 
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The analogs of the martingale stopping theorem remain valid 
for submartingales and supermartingales. We leave the proof of the 
following theorem as an exercise. 


Theorem 3.30 If N is a stopping time for the filtration F,, then 


E|Zn| > E[Z,| for a submartingale 
E|Zn]| 


x 


< E[Z,] for a supermartingale 
provided that any of the sufficient conditions of Theorem 3.14 hold. 


One of the most useful results about submartingales is the Kol- 


mogorov inequality. Before presenting it, we need a couple of lem- 
mas. 


Lemma 3.31 If Z,,n > 1, ts a submartingale for the filtration Fy, 


and N is a stopping time for this filtration such that P(N <n) =1, 
then 


E|Z,] < E[Zy] < E[Zn] 


Proof. Because N is bounded, it follows from the submartingale 
stopping theorem that E[Zy] > E[Z,]. Now, 


E|Zn|Fu,N =k) = E[Zn|F,] > Ze = Zn 
Taking expectations of this inequality completes the proof. m 


Lemma 3.32 If Z,,n > 1, is a martingale with respect to the filtra- 
tion Fn,n > 1, and f a convex function for which E||f(Zn)|] < 00, 


then f(Zn),n > 1, is a submartingale with respect to the filtration 
Fst > 1: 


Proof. 


El f(Zn4+1)|Fn] > f(ElZn4i|Fn]) ‘by Jensen’s inequality 
= f(Zn) 


Theorem 3.33 Kolmogorov’s Inequality for Submartingales Sup- 
pose Zn,n > 1, ts a nonnegative submartingale, then fora > 0 


P(max{Z;,...,Zn} >a) < E[Z,]|/a 
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Proof Let N be the smallest 2, i <n such that Z; > a, and let it 
equal n if Z; < a for alli =1,...,n. Then 


P(max{Zj,...,Zn} >a) = P(Zn >a) 
E[Zn\/a by Markov’s inequality 


E[Z,\/a  sinceeN<n SG 


IN IA 


Corollary 3.34 Jf Z,,n > 1, is a martingale, then fora > 0 
P(max{|Zi|,---,|Zn|} 2 @) < min(E[|Z,l]/a, E[Z?2]/a”) 
Proof Noting that 
P(max{|Z,|,...,|Zn|} > a) = P(max{Z?...,Z2} > a?) 
the corollary follows from Lemma 3.32 and Kolmogorov’s inequality 


for submartingales upon using that f(x) = |z| and f(x) = 2? are 
convex functions. & 


Theorem 3.35 The Martingale Convergence Theorem Let Z,,, n > 
1, be a martingale. If there 1s M < 00 such that 


E[|Zn||}< M forall n 
then, with probability 1, limnpioo Zn exists and is finite. 


Proof We will give a proof under the stronger condition that E[Z?] 
is bounded. Because f(z) = z? is convex, it follows from Lemma 
3.32 that Z2,n > 1, is a submartingale, yielding that E[Z?] is non- 
decreasing. Because E[Z?] is bounded, it follows that it converges; 
let m < oo be given by 


oe 2 
Mice eee 
We now argue that limnj... Zp exists and is finite by showing that, 
with probability 1, Z,,n > 1, is a Cauchy sequence. That is, we will 
show that, with probability 1, 


Zm+k — Zm| — 0 as m,k — oo 
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Using that Zm+z—Zm, k > 1, is a martingale, it follows that (Zmiz— 
Zm)*, k > 1, is a submartingale. Thus, by Kolmogorov’s inequality, 
P(\|Zm+k — Zm| > € for some k < n) 
= P( max (Zm+k — Zm)? > €7) 
| ee 7 


< El(Zmin — m)*|/€ 
= E|Zh+m — 2ZmZn+m + Zm)/e 


However, 
E(ZmZn+m) = ElLE|ZmZn+m|Zml] 
— E|ZmE|Zn+m|Zm]| 
= E(Z7] 
Therefore, 


P(\Zim+k — Zm| >€ for some k <n) < (E[Z?,,,] — E[Z2})/e? 
Letting n — oo now yields 
P(|Zm+k — Zm| >€ for some k) < (m— E[Z2,])/e? 
Thus, 
P(\Zm+k —Zm| >¢€ for some k)—+0 asm— oo 


Therefore, with probability 1, Z,,n > 1, is a Cauchy sequence, and 


so has a finite limit. | 


As a consequence of the martingale convergence theorem we ob- 
tain the following. 


Corollary 3.36 If Z,,n > 1, is a nonnegative martingale then, with 
probability 1, limpo Zn exists and is finite. 


Proof Because Z, is nonnegative 


E||Z,|] = E[Zn] = E[Z1] < 00 i 
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Example 3.37 A branching process follows the size of a population 
over succeeding generations. It supposes that, independent of what 
occurred in prior generations, each individual in generation n inde- 
pendently has 7 offspring with probability p;,7 > 0. The offspring 
of individuals of generation n then make up generation n+ 1. Let 
X, denote the number of individuals in generation n. Assuming 
that m = )/,Jjp;, the mean number of offspring of an individual, 
is finite it is east to verify that Z, = X,/m",n > 0, is a martin- 
gale. Because it is nonnegative, the preceding corollary implies that 
lim, Xn/m" exists and is finite. But this implies, when m < 1, 
that lim, X, = 0; or, equivalently, that X, = 0 for all n sufficiently 
large. When m > 1, the implication is that the generation size either 
becomes 0 or converges to infinity at an exponential rate. a 
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3.7 


1. 


Exercises 


For F = {¢,9}, show that E[X|F] = E[X]. 


. Give the proof of Proposition 3.2 when X and Y are jointly 


continuous. 


. If E [|X;|] < co, i=1,...,n, show that 


El) XilF] = >> EXIF] 
i=l i=l 


. Prove that if f is a convex function, then 


Elf(X)|F] = f(E[X|F)) 
provided the expectations exist. 
. Let Xy,X2,..., be independent random variables with mean 


1. Show that Z, = []j_, Xi, n > 1, is a martingale. 


~ Tf E[Xn4i|X1,.-., Xn] = anXn+by for constants an, bn, n > 0, 


find constants A,, B, so that Z, = AnXn+ Brin > 0, isa 
martingale with respect to the filtration o(Xo,..., Xn). 


. Consider a population of individuals as it evolves over time, 


and suppose that, independent of what occurred in prior gen- 
erations, each individual in generation n independently has j 
offspring with probability p;,7 > 0. The offspring of individuals 
of generation n then make up generation n + 1. Assume that 
m = 57, jpj < oo. Let X, denote the number of individuals 
in generation n, and define a martingale related to Xn,n > 0. 
The process Xn,n > 0 is called a branching process. 


. Suppose Xj, Xo,..., are independent and identically distributed 


random variables with mean zero and finite variance o2. If T 
is a stopping time with finite mean, show that 


T 
Var() | Xi) = 0? E(T). 


i=1 
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10. 


11. 


12. 


13. 
14. 


15. 
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. Suppose X,, Xo, ..., are independent and identically distributed 


mean 0 random variables which each take value +1 with prob- 
ability 1/2 and take value -1 with probability 1/2. Let S, = 
> j-1 Xi- Which of the following are stopping times? Compute 
the mean for the ones that are stopping times. 

(a) T; = min{i > 5: S; = S;_5 + 5} 

(b) Tg = T, —5 

(c) T3 = T+ 10. 


Consider a sequence of independent flips of a coin, and let 
P, denote the probability of a head on any toss. Let A be 
the hypothesis that P, = a and let B be the hypothesis that 
P,;, = b, for given values 0 < a,b < 1. Let X; be the outcome of 
flip 7, and set 
— P(X,...,XnA) 
" P(X,...,Xn|B) 
If P, = b, show that Z,,n > 1, is a martingale having mean 1. 


Let Zn, > 0 be a martingale with Z) = 0. Show that 
nr 
E[Z2] = >— El(Z — Zi-1)"| 
i=l 


Consider an individual who at each stage, independently of 
past movements, moves to the right with probability p or to 
the left with probability 1 — p. Assuming that p > 1/2 find 
the expected number of stages it takes the person to move 1 
positions to the right from where she started. 


In Example 3.19 obtain bounds on p when @ < 0. 


Use Wald’s equation to approximate the expected time it takes 
a random walk to either become as large as a or as small as —b, 
for positive a and b. Give the exact expression if a and b are 
integers, and at each stage the random walk either moves up 1 
with probability p or moves down 1 with probability 1 — p. 


Consider a branching process that starts with a single individ- 
ual. Let a denote the probability this process eventually dies 
out. With X,, denoting the number of individuals in generation 
n, argue that 7*", n > 0, is a martingale. 
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16. 


17. 


18. 


19. 


20. 


21. 


22. 


23. 


Given X1, XQ, eoey let Dy = ar Xj and Fs = o(X),...Xn). 
Suppose for all n E|S,| < oo and E[S,41|F,;] = Sp. Show 
E[X;, Xj] = 0 if i Fj. 


Suppose n random points are chosen in a circle having diam- 
eter equal to 1, and let X be the length of the shortest path 
connecting all of them. For a > 0, bound P(X — E|X] > a). 


Let X1, X2,...,Xn be independent and identically distributed 
discrete random variables, with P(X; = j) = p;. Obtain 
bounds on the tail probability of the number of times the pat- 
tern 0,0,0,0 appears in the sequence. 


Repeat Example 3.27, but now assuming that the X; are inde- 
pendent but not identically distributed. Let P,; = P(X; = 7). 


Let Zn,n > 0, be a martingale with mean Zp = O, and let 
v;3,j = 0, be a sequence of nondecreasing constants with vp = 0. 
Prove the Kolmogorov-Hajek-Renyi inequality: 


nr 
P(|Z;| < vj, for all j =1,...,n) >1— S> El(Z; = Zj-1)"\/v 
j=l 


Consider a gambler who plays at a fair casino. Suppose that the 
casino does not give any credit, so that the gambler must quit 
when his fortune is 0. Suppose further that on each bet made 
at least 1 is either won or lost. Argue that, with probability 1, 
a gambler who wants to play forever will eventually go broke. 


What is the implication of the martingale convergence theorem 
to the scenario of Exercise 10? 


Three gamblers each start with a,b, and c chips respectively. 
In each round of a game a gambler is selected uniformly at 
random to give up a chip, and one of the other gamblers is 
selected uniformly at random to receive that chip. The game 
ends when there are only two players remaining with chips. 
Let Xn,Yn, and Z, respectively denote the number of chips 
the three players have after round n, so (Xo, Yo, Zo) = (a, b,c). 
(a) Compute E[Xni1Yn41Zn4i1 | (Xn, Yn, Zn) = (2, y, 2)). 
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(b) Show that My, = XnYnZ,+n(a+6+c)/3 is a martingale. 
(c) Use the preceding to compute the expected length of the 
game. 


Chapter 4 


Bounding Probabilities and 
Expectations 


4.1 Introduction 


In this chapter we develop some approaches for bounding expec- 
tations and probabilities. We start in Section 2 with Jensen’s in- 
equality, which bounds the expected value of a convex function of a 
random variable. In Section 3 we develop the importance sampling 
identity and show how it can be used to yield bounds on tail probabil- 
ities. A specialization of this method results in the Chernoff bound, 
which is developed in section 4. Section 5 deals with the Second 
Moment and the Conditional Expectation Inequalities, which bound 
the probability that a random variable is positive. Section 6 develops 
the Min-Max Identity and use it to obtain bounds on the maximum 
of a set of random variables. Finally, in Section 7 we introduce some 
general stochastic order relations and explore their consequences. 


4.2 Jensen’s Inequality 


Jensen’s inequality yields a lower bound on the expected value of a 
convex function of a random variable. 


Proposition 4.1 If f is a convex function, then 
E(f(X)] = f(E[X]) 
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provided the expectations exist. 


Proof. We give a proof under the assumption that f has a Taylor 
series expansion. Expanding f about the value p = E|X], and using 
the Taylor series expansion with a remainder term, yields that for 
some a 


f(x) = f(w) + f\(u)(@ — n) + F"(a)(a — n)?/2 
> f(u)+ f'(u)(z- p) 
where the preceding used that f”(a) > 0 by convexity. Hence, 
f(X) = f(u) + f'(u)(X — wv) 
Taking expectations yields the result. 


Remark 4.2 If 
P(X =2;)=A=1-— P(X = 22) 
then Jensen’s inequality implies that for a convex function f 
Af (t1) + (1 — A) f(a) 2 FA + (1 — A)z2), 


which is the definition of a convex function. Thus, Jensen’s inequal- 
ity can be thought of as extending the defining equation of convexity 
from random variables that take on only two possible values to arbi- 
trary random variables. 


4.3. Probability Bounds via the Importance Sam- 
pling Identity 


Let f and g be probability density (or probability mass) functions; 
let h be an arbitrary function, and suppose that g(x) = 0 implies that 


f(x)h(xz) = 0. The following is known as the importance sampling 
identity. 


Proposition 4.3 The Importance Sampling Identity 


h(X) F(X) 
E;|h(X)| = Eg| -————_ 
where the subscript on the expectation indicates the density (or mass 
function) of the random variable X. 
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Proof We give the proof when f and g are density functions: 
oo 
Bj(MX)] = [ A(x) f(@)de 
—oo 


— [? PSE) 0) ae 
=|. g(x) gz) 4 


£, (LOLA, 
“s g(X) 


The importance sampling identity yields the following useful corol- 
lary concerning the tail probability of a random variable. 


Corollary 4.4 


P(X >c) = Bay |X > c|Pg(X > c) 


Proof. 


P(X >c) = Eglltxscq] 
Herat), 
G(X) 
(JtxraF%) 
" g(X) 


= E las |X > clPg(X > c) 


|X > c]P,(X > c) 


Example 4.5 Bounding Standard Normal Tail Probabilities Let f be 
the standard normal density function 


—z?/2 


1 
F(z) = Tee ,; 


For c > 0, consider Ps(X > c}, the probability that a standard 
normal random variable exceeds c. With 


=00. <<. 08 


g(x) =ce™, x«>O0 
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we obtain from Corollary 4.4 


e © 
cv 27 


2 
e° 
cv 27 


P;(X >) E,|e7*'/*e* |X > 


E, fe (Xte)"/2 ee(X +e)} 


where the first equality used that P,(X > c) = e~©’, and the sec- 
ond the lack of memory property of exponential random variables to 
conclude that the conditional distribution of an exponential random 
variable X given that it exceeds c is the unconditional distribution 
of X +c. Thus the preceding yields 


2 
P;(X >) = — sle-*? (4.1) 
Noting that, for x > 0, 
l-x<e*<1l—2r+27/2 
we see that 
ne Gh pe a carn De. Gr 


Using that E[X?] = 2/c? and E[X*] = 24/c* when X is exponential 
with rate c, the preceding inequality yields 


1-1/e< E,{e~**/?] <1—1/c?+3/c' 


Consequently, using Equation (4.1 ), we obtain 


(1- 1/2) = ie < P;(X >c) < (1-1/c?+3/c 4) oo m (4.2) 


2 /2 
cv 27 

Our next example uses the importance sampling identity to bound 
the probability that successive sums of a sequence of independent 
and identically distributed normal random variables with a negative 
mean ever cross some specified positive number. 
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Example 4.6 Let Xj, Xo,... beasequence of independent and iden- 
tically distributed normal random variables with mean p < 0 and 
variance 1. Let S, = ye X; and, for a fixed A > 0, consider 


p = P(S; > Afor some k) 


Let fr(xzn) = fe(21,--.,2%) be the joint density function of X, = 
(Xy,... ,X,). That is, 


fir(Xk) = (20) —*/2e- we, (zi—p)? /2 


Also, let g, be the joint density of k independent and identically 


distributed normal random variables with mean —y and variance 1. 
That is, 


gu(xx) = (20) 7*/2e7 Lear (tH)? /2 


Note that 
Fe(xx) = eth aya Li 
9k (Xx) 

With 


j k 
Ry = {(21,..-, 2%): SiS AVG <k, Yo ai > A} 
t=1 i=l 
we have 


p = >_ P(Xx € Ry) 


k=1 
oo 

a S- Ep, lUx,er,}1 
k=1 

_ SE (Hxnera)Se(Xe) 
a" (Ke) 
oo 

rt S| Eos lUexpe Re} er 
k=] 


Now, if X, € Ry, then S, > A, implying, because p < 0, that 
e24Sk < e244 Because this implies that 


2S. 2A 
Tix,erR eo S [xper,} &* 
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we obtain from the preceding that 


oO 
ps > Eg. (x,eRe} ere) 
k=1 


lo @) 
= Hs > Eig, (I {Xx ER,}1 
k=1 


Now, if Y;,i > 1, is a sequence of independent normal random vari- 
ables with mean —p and variance 1, then 


J k 
Eg. llix,cr 3] = PO Vi < AG <k, SOY > A) 


i=1 i=1 


Therefore, from the preceding 


S 
IA 


j k 
HAS PIS, <A,,5<k, DOV: > A) 
k=1 i=1 i=l 


k 
eA P(S” Y; > A for some k) 


1=1 


= e2hA 


where the final equality follows from the strong law of large numbers 
because limp—soo D>7-1 Yi/n = —p > 0, implies P(limpoo Se, Yi = 
coo) = 1, and thus PIO, Y; > A for some k) = 1. 

The bound 


p = P(S, > Afor some k) < e744 


is not very useful when A is a small nonnegative number. In this case, 
we should condition on X,, and then apply the preceding inequality. 
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With ® being the standard normal distribution function, this yields 


[o ©) 
1 2 
= P(S; > Afor some k|X, = —(z—p)*/2 4 
- [Pts ome kX = 2) =e - 


+P(X; > A) 


A 
< / e2uA—2) o~(2-H)"/2dn 41 — B(A — p) 
—& 


A 
1 2 
P(S, > Afor some k|X; =z e7 (t-H)°/2 ay 
[. (Si |X4 ) oF 


A 
oma_1_ en (ttH)?/2ae 4 (A — p) 
21 —0Oo 


e#46(A + py) +1—0(A—p) 


Thus, for instance 


P(S, > Ofor some k) < ®(p) + 1 — ©(—p) = 20(p) 


4.4 Chernoff Bounds 


Suppose that X has probability density (or probability mass) func- 
tion f(x). For t > 0, let 


_ e* f(z) 


where M(t) = E;[e'*] is the moment generating function of X. 
Corollary 4.4 yields, for c > 0, that 


P;(X >c) = E,[M(t)e* |X > ce] P(X > 0) 
Eg[M(t)e** |X > | 


M(t)e— 


IA IA 


Because the preceding holds for all t > 0, we can conclude that 
Ps(X >c) < inf M(t) e"* (4.3) 


The inequality (4.3) is called the Chernoff bound. 
Rather than choosing the value of t so as to obtain the best 
bound, it is often convenient to work with bounds that are more 
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analytically tractable. The following inequality can be used to sim- 


plify the Chernoff bound for a sum of independent Bernoulli random 
variables. 


Lemma 4.7 For 0<p<l 
pet(-P) eee pe? < et” /8 
The proof of Lemma 4.7 was given in Lemma 3.23 above. 


Corollary 4.8 Let Xi,..-,Xn be independent Bernoulli random 
variables, and set X = )-7_, Xi. Then, for any c > 0 


P(X —E[X]2>0) < 7?" (4.4) 
P(X —E[X]<-c) < e" (4.5) 

Proof. For c>0,t>0 
P(X — E[X] 2c) 


P(e(X-FIXD > et) 
epee = IX})) by the Markov inequality 


e*Elexp{) t(X; — E|Xi})}1 


i=l 


lA 


IA 


nr 
= e Ei] [fe Pe) 
i=1 


nr 
as e {I Elet(* ~E|X;))) 
i=1 
However, if Y is Bernoulli with parameter p, then 
EletY - EY} = peti?) re (1 - pje? < et’ /8 
where the inequality follows from Lemma 4.7. Therefore, 
P(X — E[X]>0¢) < eteent?/8 


Letting t = 4c/n yields the inequality (4.4). 
The proof of the inequality (4.5) is obtained by writing it as 


P(E[X|-X 2c) < e72e*/n 


and using an analogous argument. @ 
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Example 4.9 Suppose that an entity contains n + m cells, of which 
cells numbered 1,...,n are target cells, whereas cells n+1,...,n+ 
m are normal cells. Each of these n + m cells has an associated 
weight, with w,; being the weight of cell 7. Suppose that the cells 
are destroyed one at a time in a random order such that if S is the 
current set of surviving cells, then the next cell destroyed is 2,1 € S, 
with probability w;/5- jes Wj. In other words, the probability that a 
specified surviving cell is the next one destroyed is equal to its weight 
divided by the weights of all still surviving cells. Suppose that each 
of the n target cells has weight 1, whereas each of the m normal 
cells has weight w. For a specified value of a, 0 < a < 1, let Ng 
equal the number of normal cells that are still alive at the moment 
when the number of surviving target cells first falls below an. We 
will now show that as n,m — oo, the probability mass function of 
Na becomes concentrated about the value ma”. 


Theorem 4.10 For any «€ > 0, asn— co and m — oo, 
P((1 —e)ma” < Ng < (L1+¢e)ma”) > 1 


Proof. To prove the result, it is convenient to first formulate an 
equivalent continuous time model that results in the times at which 
the n+ cells are killed being independent random variables. To do 
so, let X1,...,Xn+m be independent exponential random variables, 
with X; having weight w;,i1=1,...,n+m. Note that X; will be the 
smallest of these exponentials with probability w;/ 5— 5 Wj; further, 
given that X; is the smallest, X,,r 4 7, will be the second smallest 
with probability w,/ ii w;; further, given that X; and X; are, re- 
spectively, the first and second smallest, X,,s 4 7,7, will be the next 
smallest with probability ws/>/;4;,W;; and so on. Consequently, 
if we imagine that cell 7 is killed at time X;, then the order in which 
the n+ ™ cells are killed has the same distribution as the order in 
which they are killed in the original model. So let us suppose that 
cell z is killed at time X;,2 > 1. 

Now let tq denote the time at which the number of surviving 
target cells first falls below na. Also, let N(t) denote the number of 
normal cells that are still alive at time t; so Ng = N(t,). We will 
first show that 


P(N(to) < (1+¢€)mae”) > 1 asn,m— oo (4.6) 
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To prove the preceding, let «* be such that 0 < e«* < e€, and set 
t = —In(a(1+e*)!/”). We will prove (4.6) by showing that as n and 
m. approach oo, 

(i) P(t >t) 1, 
and 

(ii) P(N(t) < (1+ .€)ma”) > 1. 


Because the events Tg >t and N(t) < (1+ .e€)ma™ together imply 
that 


N(tTa) < N(t) < (1+ €)ma™, the result (4.6) will be established. 


To prove (i), note that the number, call it Y, of surviving target 
cells by time ¢ is a binomial random variable with parameters n and 
e~' = a(1+.*)'/”. Hence, with a = nao{(1 + «*)/” — 1], we have 

P(t. < t) = P(Y < na} = P(Y < ne" —a) < e720? /n 
where the inequality follows from the Chernoff bound (4.3). This 
proves (i), because a?/n — oo as n — oo. 

To prove (ii), note that N(t) is a binomial random variable with 
parameters m and e~“* = a” (1+e*). Thus, by letting b = ma” (e— 
e*) and again applying the Chernoff bound (4.3), we obtain 

P(N(t) > (1+ 6€)ma”} = P(N(t) > me” + b) 


_9p2 
e726" /m 


lA 


This proves (ii), because b*?/m — oo as m — oo. Thus (4.6) is 
established. ) 
It remains to prove that 
P(N(ta) = (1-—€)ma”) — 1 asn,m— oo (4.7) 


However, (4.7) can be proven in a similar manner as was (4.6); a 
combination of these two results completes the proof of the theorem. 
si 


4.5 Second Moment and Conditional Expecta- 
tion Inequalities 


The second moment inequality gives a lower bound on the probability 
that a nonnegative random variable is positive. 
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Proposition 4.11 The Second Moment Inequality 


For a non- 
negative random variable X 


(E[X])? 
E[X?] 


Proof. Using Jensen’s inequality in the second line below, we have 


E[X?] = E[X?|X > 0|P(X > 0) 
(E[X|X > 0])?P(X > 0) 
_ (Ex)? 

P(X > 0) 


P(X >0)> 


IV 


When_X is the sum of Bernoulli random variables, we can improve 


the bound of the second moment inequality. So, suppose for the 
remainder of this section that 


caSo, 
i=] 


where X; is Bernoulli with E[X;] = p;, i= 1,...,n. 
We will need the following lemma. 


Lemma 4.12 For any random variable R 


Tt 
E[XR] = > pE[R|X; = 1 
i=1 

Proof 


E[XR] = EIS XR 


SpE 


= S _{E[XiRIX; = I)p; + EX; R|X; = 0](1 — pi)} 


i=1 


nr 
=) EIR|X;=1])pj, os 
i=1 
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Proposition 4.13 The Conditional Expectation Inequality 


Tm 
Pi 
> Se eee ene eee 
P(X >0)> > EX oT 
Proof. Let R= oy, Noting that 


BIXR}=P(X>0),  EIR|X; =1)= Bls|X: =1] 


we obtain, from Lemma 4.12, 


P(X >0) = > ably |X; = 1] 


> een RI et ar 
= yn, FXIX =1 
where the final equality made use of Jensen’s inequality. m 


Example 4.14 Consider a system consisting of m components, each 
of which either works or not. Suppose, further, that for given subsets 
of components S;,j7 = 1,...,n, none of which is a subset of another, 
the system functions if all of the components of at least one of these 
subsets work. If component 7 independently works with probability 
aj, derive a lower bound on the probability the system functions. 


Solution Let X; equal 1 if all the components in S; work, and let 


it equal 0 otherwise, 1 = 1,...,n. Also, let 
Pi = i=l)= II Oj 
JES; 


Then, with X = }°¥_, Xi, we have 
P(system functions) = os > 0) 


2 aes 1 
Vee P(X; oar? =1) 


2 ee. eee 
» 1+ Doi Lkes,—s, 
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where S; — S; consists of all components that are in S; but not in 


S;. | 


Example 4.15 Consider a random graph on the set of vertices {1, 2, 
...,n}, which is such that each of the (5) pairs of vertices 1 ~ 7 
is, independently, an edge of the graph with probability p. We are 
interested in the probability that this graph will be connected, where 
by connected we mean that for each pair of distinct vertices 1 # j 
there is a sequence of edges of the form (2,71), (21, 72),--., (te, 7). 
(That is, a graph is connected if for each each pair of distinct vertices 
i and j, there is a path from 7 to 7.) 

Suppose that 

_ inte) 

n 
We will now show that if c < 1, then the probability that the graph 
is connected goes to 0 as n — oo. To verify this result, consider the 
number of isolated vertices, where vertex 7 is said to be isolated if 
there are no edges of type (7,7). Let X; be the indicator variable for 
the event that vertex 7 is isolated, and let 


— 
i=] 


be the number of isolated vertices. 


Now, 
P(X; =1) = (1—p)™" 
Also, 
E[X|X; = 1) = S— P(X; = 1X; = 1) 
j=l 
= 1+) (1-p)"”? 
j#i 
= 1+(n—1)(1—p)*”? 
Because 
(py) = eye 


~ eeln(n) 


= nm © 
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the conditional expectation inequality yields that for n large 
ni-¢ 


Therefore, 
c<1> P(X >0}—1lasn—- co 


Because the graph is not connected if X > 0, it follows that the graph 
will almost certainly be disconnected when n is large and c < 1. (It 
can be shown when c > 1 that the probability the graph is connected 
goes to 1 as n — 00.) a 


4.6 The Min-Max Identity and Bounds on the 
Maximum 


In this section we will be interested in obtaining an upper bound on 
E({max; X;], when X1,...,Xn are nonnegative random variables. ‘To 
begin, note that for any nonnegative constant c, 


if} 
max Xi<ct+ So(Xi —c)t (4.8) 


t=) 


where x*, the positive part of z, is equal to x if x > 0 and is equal to 
O otherwise. Taking expectations of the preceding inequality yields 
that 


E{max Xji| <c+ s E((X; — c)*] 
i=1 


Because (X;—c)t is a nonnegative random variable, we have 
[e.@) 
ROLort = i Pte? Saye 
ae 
= | P(X; —c> x)dz 
os 
= / P(X; > y)dy 


c 


Therefore, 


n oe) 
E{max Xi] < e+ =F P(X; > y)dy (4.9) 
i= s 
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Because the preceding is true for all c > 0, the best bound is obtained 
by choosing the c that minimizes the right side of the preceding. 
Differentiating, and setting the result equal to 0, shows that the best 
bound is obtained when c is the value c* such that 


n 
S| P(X Sc) 4 
i=1 


Because )>;__, P(X; > c) is a decreasing function of c, the value of c* 
can be easily approximated and then utilized in the inequality (4.9). 
It is interesting to note that c* is such that the expected number 
of the X; that exceed c* is equal to 1, which is interesting because 
the inequality (4.8) becomes an equality when exactly one of the X; 
exceed c. 


Example 4.16 Suppose the X; are independent exponential random 
variables with rates A;,1 = 1,...,n. Then the minimizing value c* is 


such that . 
i=1 


with resulting bound 


ue ore) 
E|[max Xi] < c+) | eV dy 
: imi YC* 


an ae 
C are, 
i=l 


In the special case where the rates are all equal, say 4; = 1, then 


l=ne° or ct =In(n) 


and the bound becomes 
E(max X;| < In(n) + 1 (4.10) 
4 


However, it is easy to compute the expected maximum of a sequence 
of independent exponentials with rate 1. Interpreting these random 
variables as the failure times of n components, we can write 


Tr 
max X; = pee 
t 
i=1 
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where T; is the time between the (i — 1)st and the 7“” failure. Using 
the lack of memory property of exponentials, it follows that the 7; 
are independent, with 7; being an exponential random variable with 
rate n —i+1. (This is because when the (i — 1)st failure occurs, the 
time until the next failure is the minimum of the n —i+1 remaining 


lifetimes, each of which is exponential with rate 1.) Therefore, in 


this case 
nr nr 


1 1 
ar sais? Tere s tage 
i=l i= 


As it is known that, for n large 


“1 
>, —~eln(n) +E 
pi 
where E = .5772 is Euler’s constant, we see that the bound yielded 
by the approach can be quite accurate. (Also, the bound (4.10) only 


requires that the X; are exponential with rate 1, and not that they 
are independent. ) Ss) 


The preceding bounds on E[max; X;] only involve the marginal 
distributions of the X;. When we have additional knowledge about 
the joint distributions, we can often do better. To illustrate this, we 
first need to establish an identity relating the maximum of a set of 
random variables to the minimums of all the partial sets. 

For nonnegative random variables Xj,...,Xn, fix x and let A; 
denote the event that X; > zx. Let 


A = max(Xj,...,Xn) 


Noting that X will be greater than z if and only if at least one of 
the events A; occur, we have 


P(X >2) = P(U™,Ai) 


and the inclusion-exclusion identity gives 


P(X >2) = DPA) — So PAA) + S~ P(AjA; Ax) 


t<j t<j<k 
+...+(-1)"*! P(A, --- An) 
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which can be succinctly written 


P(X >2)= S(t S P(Ag As) 


r=1 ty << 
Now, 
P(A;) = P(X: >2} 
P(A;A;) = P(X; > xL, Xj > x} = P(min(X;, X;) > x) 
P(A;A; Ax) = P(X; > xr, X; >a2z,Xp-> x} 


P(min(X;, X;,Xk) > r) 
and so on. Thus, we see that 
P(X >a2)= - 1)! yi P(min(Xj,,..., Xi,) > 2)) 
<< 
Integrating both sides as x goes from 0 to oo gives the result: 
n 
=S(-1)""? So Elmin(Xi,,...,Xi,)] 
r=1 W<e<ty 


Moreover, using that going out one term in the inclusion-exclusion 
identity results in an upper bound on the probability of the union, 
going out two terms yields a lower bound, going out three terms 
yields an upper bound, and so on, yields 


< 2 FIX 
E[X] > Sa Xi] — 5> Elmin(X;, X;)] 


t<j 
E[X] < Sa Xi] — }5 Elmin(X;, X;)]+ S° Elmin(Xi, X;, Xx) 
i i<j i<q<k 


E[X] >... 


and so on. 
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Example 4.17 Consider the coupon collectors problem, where each 
different coupon collected is, independent of past selections, a type 
i coupon with probability p;. Suppose we are interested in E[X], 
where X is the number of coupons we need collect to obtain at least 
one of each type. Then, letting X; denote the number that need be 
collected to obtain a type 7 coupon we have that 


X = max(Xj,...,Xn) 


yielding that 


E[X]=)0(-1)"*!? SO Elmin(Xi,,...,Xi,)] 


11 Grrr Pa 


Now, min(X;,,...,-Xi,) is the number of coupons that need be col- 

lected to obtain any of the types 71,...,%7,. Because each new type 

collected will be one of these types with probability }0"_, pi,, it 

follows that min(X;,,...,X;j,) is a geometric random variable with 
1 ° 

mean Py Thus, we obtain the result 


ap aa 


i ase 
+> ee ee 
icjck Dit Pi + Pk Pit...+ Dn 


Using the preceding formula for the mean number of coupons 
needed to obtain a complete set requires summing over 2” terms, and 
so is not practical when n is large. Moreover, the bounds obtained 
by only going out a few steps in the formula for the expected value 
of a maximum generally turn out to be too loose to be beneficial. 
However, a useful bound can often be obtained by applying the max- 
min inequalities to an upper bound for E[X] rather than directly to 
E[X]. We now develop the theory. 

For nonnegative random variables X,,..., Xn, let 


X = max(Xj,...,Xn). 
Fix c > 0, and note the inequality 


X <c+max((X; —c)t,...,(Xn —c)*). 
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Now apply the max-min upper bound inequalities to the right side 
of the preceding, take expectations, and obtain that 


E[X] < e+ > E[(Xi — c)*)] 
E|X] < e+ y El(Xi — ¢)"] - 2, Blmin((X: — ¢)*, (Xj -¢)*)] 
+ Zire - st (Xj —c)", (Xi —c)*)| 
and so on. 


Example 4.18 Consider the case of independent exponential ran- 


dom variables Xj,...,Xn, all having rate 1. Then, the preceding 
gives the bound 


E{maxX;] 


Set) El(Xi—)*] — ) | Blmin((X; - ¢)*, (Xj - ¢)*)) 
i i<j 
+ S> E{min((X; — c)*, (Xj — c)*, (X_ —c)*)] 


w<j<k 


To obtain the terms in the three sums of the right hand side of 
the preceding, condition, respectively, on whether X; > c, whether 
min(X;,X;) > c, and whether min(X;, Xj, X;) > c. This yields 


E|(X;-—c)T] = e 
E{min((X; — c)*, (Xj; —c)*)} = e77° 


E{min((X; = c)T, (X; — c)t, (X_ — c)t)] =’ g=3¢ 


CO) hm KD] 


Using the constant c = In(n) yields the bound 


n(n—1)  n(n—1)(n—-2) 
E|max Xi] < In(n) +1 a i8n3 
= In(n) + .806 for n large 
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Example 4.19 Let us reconsider Example 4.17, the coupon col- 
lector’s problem. Let c be an integer. To compute E[(X; — c)*], 
condition on whether a type i coupon is among the first c collected. 


E|(X; — ¢)*] 


E((X; — c)T|Xi < | P(Xi < c) 
+E[(X; _ c)* |X; > c| P(X; >c) 
El(X; — ©)" |Xi > (1 — pi)° 

_ (pi) 

Pi 


Similarly, 


FOTO, Cee uate, Ce a ce 


Dit Pj 
1 — pi — pj — Pr)® 
E{min X;—c)*, -—c)t, pyr a J 
[min(( )", (Xj — e)", (Xk — €)*)] po ee 


Therefore, for any nonnegative integer c 


< waa 


as2] 
ia 
IN 


an (1 — (1 — pi — p;)° 
an) < e+ Saat yp Oona 
i<j Pi 7 
1 — pi —p; — De) 
x Ss ( Pe Pe Pk) 
St, Pit Pit Ph 


4.7 Stochastic Orderings 
We say that X is stochastically greater than Y, written X >,; Y, if 
P(X >t)>P(Y >t) for allt 


In this section, we define and compare some other stochastic order- 
ings of random variables. 

If X is a nonnegative continuous random variable with distribu- 
tion function F and density f then the hazard rate function of X is 
defined by 


Ax (t) = f(t)/F(é) 
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where F(t) = 1 — F(t). Interpreting X as the lifetime of an item, 
then for € small 


P(t year old item dies within an additional time €) 
= P(X <t+e|X >t) 
=~ Ax(the 
If Y is a nonnegative continuous random variable with distribu- 


tion function G and density g, say that X is hazard rate order larger 
than Y, written X >), Y, if 


Ax(t) < Ay(t) for allt 


Say that X is likelihood ratio order larger than Y, written X >), 
Y, if 


f(x)/g(x) Tz 


where f and g are the respective densities of X and Y. 


Proposition 4.20 
X >i Y > X Spr Y > X Sst Y 


Proof. Let X have density f, and Y have density g. Suppose X >, 
Y. Then, for y > z 


f(y) f(z) 
fy) =9W)—~> aa) 2 y)~— ae) 
implying that 


[ fy)ay > 2 : g(y)dy 


Ax (x) < Ay(z) 
To prove the final implication, note first that 


or 


: _ f? f@ ee 
[ axwet= Fa t= —log F(s) 


or 


F(s) = e7 fox) 
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which immediately shows that X >p~, Y > X >gt Y. O 


Define rx(t), the reversed hazard rate function of X, by 


rx(t) = f(t)/F() 
and note that 
_ Pit-e<X|X <t) 
7 i aii 
Say that X is reverse hazard rate order larger than Y, written 
X >rn Y, if 
rx(t)>ry(t) for allt 


Our next theorem gives some representations of the orderings of 
X and Y in terms of stochastically larger relations between certain 
conditional distributions of X and Y. Before presenting it, we intro- 
duce the following notation. 


Notation For a random variable X and event A, let [X|A] de- 
note a random variable whose distribution is that of the conditional 
distribution of X given A. 


Theorem 4.21 


(1) X >pp Y & [X|X >t] See [YIY >t] for all t 
(it) X >on Y = [X|X <t] >st [Y|Y < t] for all t 
(it) X > Y & [X|s << X <t] Se [Yls << Y <t] forall s,t 


Proof. To prove (i), let X; =q [X —t|X > t] and ¥; =q [Y—-tY > ¢]. 
Noting that 


ee 0, ifs<t 
Xl V Ax(st+t), if s>t 


shows that 
X Dar Y => Xt 2hr Ye => Xt 2st Ye > [X|X >t] Sot (YY > f] 
To go the other way, use the identity 


Fx, (e) =e" Sp" Ax(s)ds 
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to obtain that 
[X |X > t| > st YY 2 t] = Xt > st Yi 
= Ax (t) < Ay (t) 


(ii) With X; =q [t— X|X < 4], 


Ax:(y) =Tx(t-y) O<y<t 


Therefore, with Y; =g [t-Y|Y < ¢] 


X2nY & Xx,(y) > Ay, (y) 
=> Xt <st Yt 
= [t—-X|X < t] <y [t-Y|Y <¢] 
= [X|X < t] >e [Y|Y <7] 


On the other hand 
[X|X <t]>e([YIY <t] eo Xi <a Y; 
+ [dxiady > [rvetaras 
= [ rx(t—y)dy > [ ry(t — y)dy 
= a > ry (t) " 
(iii) Let X and Y have respective densities f and g. Suppose 


[X|s << X <t] >e [Yls<Y <#] foralls <t. Letting s < uv < t, 
this implies that 


P(X > vs <X <t)>P(Y >v\s<Y <t) 
or, equivalently, that 


Plu <X <t) | P(u<Y <t) 
P(s<X <t)~ P(s<Y <t) 


or, upon inverting, that 


Pls <X Sv) - P(s<Y <v) 
Piu<X <t) — Piu<Y <t) 
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or, equivalently, that 
P(s<X <v) 2 P(vu<X <t) 
P(s<¥Y<v)7~ Pw<Y <t) 
Letting v | s in Equation (4.11) yields 
f(s) < Ps < X <t) 
g(s) ~ = PG <Y <t) 
whereas, letting v t t in Equation (4.11) yields 
P(s< X <t) aa f(t) 
P(s<Y <t) 7 S ole) 


Thus, oa > fs, showing that X >), Y. 

Now suppose that X >), Y. Then clearly [X|s < X < t] >1, 
[Y|s < Y < ?], implying, from Proposition 4.20, that [X|s < X < 
t]>s[Yls<Y<t]. a 


(4.11) 


Corollary 4.22 X>,Y > X>pn,Y >X > Y 


Proof. The first implication immediately follows from parts (ii) and 
(iii) of Theorem 4.21. The second implication follows upon taking 
the limit as t — oo in part (ii) of that theorem. m 


Say that X is an increasing hazard rate (or THR) random variable 


if Ax(t) is nondecreasing in t. (Other terminology is to say that X 
has an increasing failure rate.) 


Proposition 4.23 Let X; =qg [|X — t|X >t]. Then, 
(a) Xi ls ast? —& X is IHR 
(b) Xt ly astT <= log f(x) is concave 


Proof. (a) Let A(y) be the hazard rate function of X. Then ,(y), 
the hazard rate function of X; is given by 


Ar(y) =A(E+y), y>O 


Hence, if X is THR then X; |,, ¢, implying that X; |. t. Now, let 
s < t, and suppose that X, >.; X;. Then, 


on Fe My)dy P(X; > ©) < P(X, >6€) =e f2** x(y)dy 
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showing that A(t) > A(s). Thus part (a) is proved. - 
(b) Using that the density of X; is fx,(z) = f(x +t)/F(t) yields, 


for s < t, that 


Xs 2 Ir Xt => 
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=> 


=> 


=> 


f(x+s) te 
f(x+t) 

log f(z +s) —log f(z+t)Tz 
fi(zt+s)  fi(x+t) . 
f(ct+s) flx+t) ~ 


d 
dy OB FY) ly 


log f(y) is concave 


1. For a nonnegative random variable X, show that (£[X nyyl/ ” 
is nondecreasing in n. 


2. Let X be as standard normal random variable. Use Corollary 
4.4, along with the density 


g(x) = ze 


mee 
w/2 2>0 


to show, for c > 0, that , 
(a) P(X >c)= Tee 2 Ba[|X > | 
(b) Show, for any positive random variable W, that 


(c) Show that 


(d) Show that 


Elar|W > d < Elz.) 


1 
P(X > 0) < 5e°? 


E,[X|X > cd =c+e/?*VanP(X >) 
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(e) Use Jensen’s inequality, along with the preceding, to show 
that 


cP(X >c) +60 /?V/2n(P(X > c))? > see eh 
WV 


(f) Argue from (e) that P(X > c) must be at least as large 
as the positive root of the equation 


er + ec /2 [a7 x? = —c?/2 


e€ 
WV 


(g) Conclude that 


1 2 
P(X >c) > —=(V24+4-ce°” 
( )2 oe | cP4=c)e 


. Let X be a Poisson random variable with mean A. Show that, 


for n > A, the Chernoff bound yields that 


P(X >n) <2 Oe" 


Tr 


. Let m(t) = E[X']. The moment bound states that for c > 0 


P(X >c) <m(t)c* 


for all t > 0. Show that this result can be obtained from the 
importance sampling identity. 


. Fill in the details of the proof that, for independent Bernoulli 


random variables Xj,...,Xn, andc> 0, 
P(S — E[S] < -c) < e72°/™ 


where S = Soy, Xj. 


. If X is a binomial random variable with parameters n and p, 


show 
(a) P(|X —np| >c) < 2e~2e"/” 
(b) P(X —np > anp) < exp{—2np*a*} 


. Give the details of the proof of (4.7). 
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8. 


10. 


11. 


12. 


13. 


Prove that 


Elf(X)] = E[F(ELX1Y))] > f(ELX]) 


Suppose you want a lower bound on E[f(X)| for a convex 
function f. The preceding shows that first conditioning on Y 
and then applying Jensen’s inequality to the individual terms 
E|f(X)|Y = y] results in a larger lower bound than does an 
immediate application of Jensen’s inequality. — 


. Let X; be binary random variables with parameters p;,2 = 


1,...,n. Let X = }\j_, Xi, and also let I, independent of the 
variables X1,...,Xn, be equally likely to be any of the values 
1,...,n. For R independent of J, show that 

(a) PUT = i|X7 = 1) = pi/E|X] 

(b) E[XR] = E[XIE[R|X, = 1] 

(c) P(X > 0) = EX] E[2 1X7 = 1 


For X; and X as in Exercise 9, show that 


,, (E[X])* 
ART =4] - EDX) 


Thus, for sums of binary variables, the conditional expectation 
inequality yields a stronger lower bound than does the second 
moment inequality. 

Hint: Make use of the results of Exercises 8 and 9. 


Let X; be exponential with mean 8 + 2%, for i = 1,2,3. Obtain 
an upper bound on E[max Xj], and compare it with the exact 
result when the X; are independent. 


Let U;,i = 1,...,n be uniform (0,1) random variables. Obtain 
an upper bound on £[max Uj], and compare it with the exact 
result when the U; are independent. 


Let U; and U2 be uniform (0,1) random variables. Obtain an 


upper bound on E[max(U};, U2)|, and show this maximum is 
obtained when U; = 1 — U3. 
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14. 


15. 


16. 


Chapter 4 Bounding Probabilities and Expectations 


Show that X >), Y if and only if 


P(X >t) _ P(Y >?) 
P(X >s)~ P(Y Ss) 


for all s < ft. 
Let h(z,y) be a real valued function satisfying 
h(z,y) > h{y,z) whenever x >y 


(a) Show that if X and Y are independent and X >), Y then 
h(X,Y) Dee h(Y,X). 

(b) Show by a counterexample that the preceding is not valid 
under the weaker condition X >, Y. 


There are n jobs, with job i requiring a random time X; to 
process. The jobs must be processed sequentially. Give a suffi- 
cient condition, the weaker the better, under which the policy 
of processing jobs in the order 1,2,..., maximizes the prob- 


ability that at least k jobs are processed by time ¢t for all k and 
t. 


Chapter 5 


Markov Chains 


5.1 Introduction 


This chapter introduces a natural generalization of a sequence of in- 
dependent random variables, called a Markov chain, where a variable 
may depend on the immediately preceding variable in the sequence. 
Named after the 19th century Russian mathematician Andrei An- 
dreyevich Markov, these chains are widely used as simple models of 
more complex real-world phenomena. 

Given a sequence of discrete random variables Xo, .X1, Xo,... tak- 
ing values in some finite or countably infinite set S, we say that X, 
is a Markov chain with respect to a filtration F,, if Xn € Fp for all 
n and, for all B C S, we have the Markov property 


P(Xn41 € B| Fn) = P(Xn4i € BlXn). 


If we interpret X,, as the state of the chain at time n, then the 
preceding means that if you know the current state, nothing else 
from the past is relevant to the future of the Markov chain. That 
is, given the present state, the future states and the past states are 


independent. When we let F, = o( Xo, X1,...,Xn) this definition 
reduces to 


P(Xn41 = j| Xn rena 1,.Xn-1 = ie eee, = io) 
= P(Xn41 9), Xn —= i). 


If P(Xn41 = j| Xn = 7) is the same for all n, we say that the Markov 


14] 
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chain has stationary transition probabilities, and we set 
Pi = P(Xng1 = 5| Xn = t) 


In this case the quantities P;; are called the transition probabilities, 
and specifying them along with a probability distribution for the 
starting state Xo is enough to determine all probabilities concerning 
X0,.--;Xn- We will assume from here on that all Markov chains 
considered have stationary transition probabilities. In addition, un- 
less otherwise noted, we will assume that S, the set of all possible 
states of the Markov chain, is the set of nonnegative integers. 


Example 5.1 Reflected random walk. Suppose Y; are independent 
and identically distributed Bernoulli(p) random variables, and let 
Xo = 0 and Xy, = (Xn-1 + 2Y¥, — 1)* for n = 1,2,.... The process 
Xn, called a reflected random walk, can be viewed as the position 
of a particle at time n such that at each time the particle has a p 
probability of moving one step to the right, a 1 — p probability of 
moving one step to the left, and is returned to position 0 if it ever 
attempts to move to the left of 0. It is immediate from its definition 
that X,, is a Markov chain. 


Example 5.2 A non-Markov chain. Again let Y; be independent and 
identically distributed Bernoulli(p) random variables, let Xo = 0, and 
this time let X, = Yn+Yn-1 for n = 1,2,.... It’s easy to see that X, 
is not a Markov chain because P(Xn4) = 2|Xn = 1, Xn-1 = 2) = 0 
while on the other hand P(Xn41 = 2|Xp = 1, Xn-1 = 0) = p. 

5.2 The Transition Matrix 

The transition probabilities 


Pi = P(X1 = 3|Xo = 2), 


are also called the one-step transition probabilities. We define the n- 
step transition probabilities by 


PA”) = P(Xn = i|Xo = 4). 
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In addition, we define the transition probability matrix 


Poo Foi Poo -:: 
p—| Pio Pu Pio::: 


and the n-step transition probability matrix 


°, a1 02 - 
n nm nm 
pi) — Pyxo Phi Pio is 


An interesting relation between these matrices is obtained by 
noting that 


PY*™) = S~ P(Xntm = j1Xo = 1, Xn = k)P(Xp = k|Xo = 3) 
k 
= ) p{n) 
= IAM PL. 
k 
The preceding are called the Chapman-Kolmogorov equations. 
If follows from the Chapman-Kolmogorov equations that 
Pre) = Ps Pe 
where here “x” represents matrix multiplication. Hence, 
Pp?) =p xP, 


and, by induction, 
p© — pr, 


where the right-hand side represents multiplying the matrix P by 
itself n times. 


Example 5.3 Reflected random walk. A particle starts at position 
zero and at each time moves one position to the right with probability 
p and, if the particle is not in position zero, moves one position to 
the left (or remains in state 0) with probability 1 — p. The position 
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X, of the particle at time n forms a Markov chain with transition 
matrix 


l-p p 000--: 
l1-p 0 pOO--: 


P=) 9 1-pO0p0d--. 


Example 5.4 The two state Markov chain. Consider a Markov chain 
with states 0 and 1 having transition probability matrix 


a l-a 
Pp — 
Be 
The two-step transition probability matrix is given by 


sikad Pv er a a 
of + B(1— 8) 1- a8 - B(1- 8) 


5.3. The Strong Markov Property 


Consider a Markov chain X, having one-step transition probabilities 
P,;, which means that if the Markov chain is in state 7 at a fixed 
time n, then the next state will be 7 with probability P,;. However, 
it is not necessarily true that if the Markov chain is in state 7 at a 
randomly distributed time T,, the next state will be 7 with probability 
P,;. That is, if T is an arbitrary nonnegative integer valued random 
variable, it is not necessarily true that P(X741 = j|Xr = 1) = Bj. 
For a simple counterexample suppose 


T= min(n ee 1, Xn41 = j) 


Then, clearly, 

P(X741 = 9|X7r =i) = 1. 
The idea behind this counterexample is that a general random vari- 
able T may depend not only on the states of the Markov chain up 
to time T but also on future states after time T. Recalling that T is 
a stopping time for a filtration F, if {T = n} € F, for every n, we 
see that for a stopping time the value of T can only depend on the 
states up to time t. We now show that P(X74in = j|Xr = 1) will 
equal Pw provided that T is a stopping time. 
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This is usually called the strong Markov property, and essentially 
means a Markov chain “starts over” at stopping times. Below we 
define Fr = {A : AN{T = t} € F, for all t}, which intuitively 
represents any information you would know by time T’. 


Proposition 5.5 The strong Markov property. Let Xn,n > 0, be a 
Markov chain with respect to the filtration F,. If T < co a.s. is a 
stopping time with respect to F,, then 

P(Xrin = Xr =1, Fr) = Be 


Proof. 
P(X in = 4|Xr = i, Fr, T — t) = P(Xtin — Xe = 1,F4,10 = t) 


P(Xtan = aXe = i, Fz) 
= Po 


where the next to last equality used the fact that T is a stopping 
time to give {T=t}¢ F;,. = 


Example 5.6 Losses in queueing busy periods. Consider a queu- 
ing system where X, is the number of customers in the system at 
time n. At each time n = 1,2,... there is a probability p that a 
customer arrives and a probability (1 — p), if there are any cus- 
tomers present, that one of them departs. Starting with Xp = 1, 
let T = min{t > 0: X; = 0} be the length of a busy period. Sup- 
pose also there is only space for at most m customers in the system. 
Whenever a customer arrives to find m customers already in the sys- 
tem, the customer is lost and departs immediately. Letting Nm be 
the number of customers lost during a busy period, compute E[N,,]. 


Solution. Let A be the event that the first arrival occurs before the 
first departure. We will obtain E[N,,| by conditioning on whether A 
occurs. Now, when A happens, for the busy period to end we must 
first wait an interval of time until the system goes back to having a 
single customer, and then after that we must wait another interval 
of time until the system becomes completely empty. By the Markov 
property the number of losses during the first time interval has dis- 
tribution Nj,—1, because we are now starting with two customers and 
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therefore with only m—1 spaces for additional customers. The strong 
Markov property tells us that the number of losses in the second time 
interval has distribution N,,. We therefore have 


E[N;,|A] = { E(Nm-1] + E|Nm] ifm > 1 


and using P(A) = p and E[N,,|A‘| = 0 we have 


E[Nm] = E[Nmm|A]P(A) + ElNin|A°P(A°) 
= pE|Nm-1] + pE[Nmn] 


for m > 1 along with 
E(Ni] = p+ pE[Ni| 
and thus 


It’s quite interesting to notice that E[N,,| increases in m when 
p > 1/2, decreases when p < 1/2, and stays constant for all m when 
p = 1/2. The intuition for the case p = 1/2 is that when m increases, 
losses become less frequent but the busy period becomes longer. m 


We next apply the strong Markov property to obtain a result for 
the cover time, the time when all states of a Markov chain have been 
visited. 


Proposition 5.7 Cover times. Given an N-state Markov chain Xy, 
let T; = min{n > 0: X, =1} and let C = max; T; be the cover time. 
Then E[C] < SV _, + max;,; E[T;j|Xo = i]. 


Proof. Let 1), J2,...,[N be a random permutation of the integers 
1,2,...,N chosen so that all possible orderings are equally likely. 
Letting T;, = 0 and noting that maxj<m Ty, — maxj<m-11i, is 
the additional time after all states [;1,...,Jm—1 have been visited 
until all states [,,...,Jm have been visited, we see that 


N 


C= ) max T;. — max T7.). 
2 7 j<m-1 13) 
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Thus, we have 
N 
BIC|= 90 [max — ax, Tr, 


N 
= —F |maxT;. — max 77.|/77;. > max Ty. 
fee Ait TY 


m j<m 
m=1 
4 
< — max E|T;|Xo = 3], 
m™m 
m=1 


where the second line follows since all orderings are equally likely 


and thus P(T;,, > maxj;<m-171,) = 1/m, and the third from the 
strong Markov property. @ 


5.4 Classification of States 


We say that states i,j of a Markov chain communicate with each 
other, or are in the same class, if there are integers n and m such 
that both P{”) > 0 and P{”) > 0 hold. This means that it is possible 
for the chain to get from i to j and vice versa. A Markov chain is 
called irreducible if all states are in the same class. 

For a Markov chain Xn, let 


T; = min{n > 0: X, = i} 
be the time until the Markov chain first makes a transition into state 


i. Using the notation F;[---] and P(---) to denote that the Markov 
chain starts from state 7, let 


fi = Pi(Ti < 0) 
be the probability that the chain ever makes a transition into state 


i given that it starts in state 1. We say that state i is transient if 
f, <1 and recurrent if f; = 1. Let 


oe) 
Ni = a L(x, =i} 
n=1 
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be the total number of transitions into state 7. The strong Markov 
property tells us that, starting in state i, 


N;+1~ geometric(1 — fj), 


since each time the chain makes a transition into state i there is, 


independent of all else, a (1 — f;) chance it will never return. Conse- 
quently, 


oo 
ExNi] = 0 Pe 
n=1 


is either infinite or finite depending on whether or not state i is 
recurrent or transient. 


Proposition 5.8 Jf state i is recurrent and i communicates with j, 
then j ts also recurrent. 


Proof. Because 7 and 7 communicate, there exist values n and m 
such that Pe Pe }>0. But for any k > 0 


+m+k (m) p(k 
Pyne > PP Ry 


where the preceding follows because pirtm+*) is the probability 
starting in state 7 that the chain will be back in 7 after n+m-+k tran- 
sitions, whereas fe Bees is the probability of the same event oc- 
curring but with the additional condition that the chain must also be 
in 7 after the first m transitions and then back in i after an additional 
k transitions. Summing over k shows that 


N\> YO PSY > PPD PL = 0 
k k 
Thus, 7 is also recurrent. m@ 


aes i : (n) 
Proposition 5.9 If j is transient, then )°7, Pi,’ < co 


Proof. Note that 


© @) 
Ei|Nj| = EAS Fusen] = >, Pe 
n=1 


n=0 
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Let f;; denote the probability that the chain ever makes a transition 
into 7 given that it starts at 7. Then, conditioning on whether such a 
transition ever occurs yields, upon using the strong Markov property, 
that 

E;[Nj] = (1 + Ej[N3]) fig < 00 


since 7 is transient. @ 


If 2 is recurrent, let 
= E,[T|] 


denote the mean number of transitions it takes the chain to return to 
state 2, given it starts in i. We say that a recurrent state 7 is null if 
p= oo and positive if u; < oo. In the next section we will show that 
positive recurrence is a class property, meaning that if 7 is positive 
recurrent and communicates with j then 7 is also positive recurrence. 


(This also implies, using that recurrence is a class property, that so 
is null recurrence.) 


5.5 Stationary and Limiting Distributions 


For a Markov chain X, starting in some given state 7, we define the 
limiting probability of being in state 7 to be 


P; = lim Pj; 
if the limit exists and is the same for all 7. 


It is easy to see that not all Markov chains will have limiting 
probabilities. For instance, consider the two state Markov chain with 


Po. = Pio = 1. For this chain p& ) will equal 1 when n is even and 0 
when n is odd, and so has no limit. 


Definition 5.10 State i of a Markov chain X,, is said to have period 


d if d is the largest integer having the property that po = 0 whenn 
is not a multiple of d. 


Proposition 5.11 If states 1 and 7 communicate, then they have the 
same period. 
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Proof. Let d, be the period of state k. Let n,m be such that 
Pe 50. Now, if Pp? > 0, then 
+n+ ) pl ) 
o Vamaaaale ON ty oie 
So d; divides r + n + m. Moreover, because 
PE > POPO > 0 


the same argument shows that d; also divides 2r + n + m; therefore 
d; divides 2r-+n+m-—(r+n+m) = r. Because d; divides r whenever 
pw”) > 0, it follows that d; divides d;. But the same argument can 
now be used to show that d; divides d;. Hence, dj =d;. 


It follows from the preceding that all states of an irreducible 
Markov chain have the same period. If the period is 1, we say that 
the chain is aperiodic. It’s easy to see that only aperiodic chains can 
have limiting probabilities. 


Intimately linked to limiting probabilities are stationary proba- 
bilities. The probability vector 7;,2 € S is said to be a stationary 
probability vector for the Markov chain if 


y= > mPy for all q 
a 
dT 
j 


Its name arises from the fact that if the Xo is distributed according 
to a stationary probability vector {7;} then 


P(X, = 7) = So P(X = 7|Xo = 1) 1; = > mPa = 15 
a a 


1 


and, by a simple induction argument, 
P(Xn = j) = >) P(Xn = j\Xn-1 = 1)P(Xn-1 = 8) = DP = 7; 
i i 


Consequently, if we start the chain with a stationary probability 


vector then X,,Xn+1,... has the same probability distribution for 
all n. 


The following result will be needed later. 
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Proposition 5.12 An irreducible transient Markov chain does not 
have a stationary probability vector. 


Proof. Assume there is a stationary probability vector 7;,7 > 0 and 
take it to be the probability mass function of Xo. Then, for any 7 


aj = P(Xn = j) = Yo iP 
i 
Consequently, for any m 


7; = lim anh 


n—00 
< in 1 P. Py + be Ti) 
i>m 
= 2 a; lim By + >»: TW; 
i<m class i>m 
oe 
i>m 


where the final equality used that 5° j Be < oo because 7 is tran- 


sient, implying that limp Py = 0. Letting m — oo shows that 
m; = 0 for all j, contradicting the fact that >’, j = 1. Thus, assum- 


ing a stationary probability vector results in a contradiction, proving 
the result. 


The following theorem is of key importance. 


Theorem 5.13 An irreducible Markov chain has a stationary proba- 
bility vector {7;} if and only if all states are positive recurrent. The 
stationary probability vector is unique and satisfies 


15 = 1/p; 


Moreover, if the chain is aperiodic then 
mee (n) 
T= lim Pi 


To prove the preceding theorem, we will make use of a couple of 
lemmas. 
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Lemma 5.14 For an irreducible Markov chain, if there exists a sta- 
tionary probability vector {7;} then all states are positive recurrent. 
Moreover, the stationary probability vector is unique and satisfies 


75 = 1/p; 
Proof. Let 2; be stationary probabilities, and suppose that P(Xo = 
j) = 7; for all j. We first show that 7; > 0 for all i. To verify this, 
suppose that 7, = 0. Now for any state j, because the chain is irre- 


ducible, there is an n such that Pe > 0. Because Xo is determined 
by the stationary probabilities, 


t= P(X, =k) =) mP > a, PO 
a 


Consequently, if 7, = 0 then so is 7;. Because j was arbitrary, that 
means that it 7; = 0 for any i, then 7; = 0 for all 7. But that would 
contradict the fact that 5°; 7; = 1. Hence any stationary probabil- 
ity vector for an irreducible Markov chain must have all positive 
elements. 

Now, recall that T; = min(n > 0: Xn = 7). So, 


py = E[T;|Xo = J] 
fo @) 
= S > P(T; > n|Xo = 5) 
n=l 
_ * P(T;j 2 0, Xo = 3) 
n=1 P(Xo = j) 


Because Xo is chosen according to the stationary probability vector 
{7;}, this gives 


oe) 
mH; = >, P(T; >, Xo = 3) (5.1) 
n=1 
Now, 
P(T; > 1,X0 = j) = P(Xo= 9) =; 
and, for n > 2 
P(T; 2 n, Xo = J) 

P(X; #j,1<i<n—-1,X9 = 7) 
P(X; #j,1 SiS n—1)— P(X, # 5,0 <Sisn-}) 
P(X; #j,1 Si <n—-1)—P(XiF5,1<i<n) 


IV 


I 
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where the final equality used that Xo,..., Xn— has the same proba- 
bility distribution as X1,...,Xn, when Xo is chosen according to the 
stationary probabilities. Substituting these results into (5.1) yields 


Tj Mg = 15 + P(X # 9) — lim P(Xi #5,1 St <7) 


But the existence of a stationary probability vector implies that the 
Markov chain is recurrent, and thus that lim, P(X; # j,1 <i < 
n) = P(X; # j,for alli > 1) = 0. Because P(X; # j) = 1—7;, we 
thus obtain 

15 = 1/p; 
thus showing that there is at most one stationary probability vector. 


In addition, because all 7; > 0, we have that all u; < oo, showing 
that all states of the chain are positive recurrent. @ 


Lemma 5.15 If some state of an irreducible Markov chain is positive 
recurrent then there exists a stationary probability vector. 


Proof. Suppose state k is positive recurrent. Thus, 
be = Ex [Tk] < 00 


Say that a new cycle begins every time the chain makes a transition 
into state k. For any state j, let A; denote the amount of time the 
chain spends in state 7 during a cycle. Then 


ee) 
E{A;] = Ex > Ix ,=3,Te>n}] 


n=0 


[© .@) 
> Ex. x,,=3,T.>n}] 


n=0 


ie. @) 
> Pa(Xn = 5, Tk > 2) 


n=0 


We claim that 7; = E[Aj|/ux, j > 0, is a stationary probability 
vector. Because Ely; A,| is the expected time of a cycle, it must 
equal E;,[T,,], showing that 


Somj=1 
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Moreover, for 7 # k 


Finally, 


HET; 


S > Pa(Xn = 5, Tr > 7) 


n>0 

S-SS Pe(Xn = 9,1, >n—-1,Xn-1 = i) 
n>1 1% 

>> S Pe(Te >n—1,Xp_-1 = 1) 

n>1 2 


x Ph(Xn = j|Tk > n -— 1, Xn-1 = 1) 
S> » P, (IT, > n — 1, Xpn-1 = 1) Pi; 


a n>1 


> > Pe (Ti >n, Xn = i)P,; 
i n>0 


S > ELA Pis 
Uk oe Ti Pij 


> mPx = 
i 


ll lI | 
an M 
pe Se 
ae l 
ae 
a 


and the proof is complete. & 


Note that Lemmas (5.14) and (5.15) imply the following. 


Corollary 5.16 If one state of an irreducible Markov chain 1s positive 
recurrent then all states are positive recurrent. 


We are now ready to prove Theorem 5.13. 


Proof. All that remains to be proven is that if the chain is aperi- 
odic, as well as irreducible and positive recurrent, then the stationary 
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probabilities are also limiting probabilities To prove this, let 7;,1 > 0, 
be stationary probabilities. Let X,,n > 0 and Y,,n > 0 be indepen- 
dent Markov chains, both with transition probabilities P,;, but with 
Xo = 7 and with P(Yo = 1) = 7;. Let 


N = min(n: Xp = Yn) 


We first show that P(N < oo) = 1. To do so, consider the Markov 
chain whose state at time n is (X,, Y,), and which thus has transition 
probabilities P.G.j),(k,r) = Pix Pir 

That this chain is irreducible can be seen by the following ar- 
gument. Because {.X,,} is irreducible and aperiodic, it follows that 
for any state i there are relatively prime integers n,m such that 
p) pi) > 0. But any sufficiently large integer can be expressed as 
a linear combination of relatively prime integers, implying that there 
is an integer N; such that 


P™ >0 for alln> N; 
Because i and 7 communicate this implies the existence of an integer 
Nij such that 

Py >0O forall n> Nig 


Hence, 


PGs = pi Pp) > 0 for all sufficiently large n 
which shows that the vector chain (X,, Y,) is irreducible. 


In addition, we claim that 7;,; = 1,7; is a stationary probability 
vector, which is seen from 


Ti0;j = S| tkP ei So tr Pri = a Ter Pr iPr j 
k r k,r 


By Lemma 5.14 this shows that the vector Markov chain is positive 
recurrent and so P(N < oo) = 1 and thus lim, P(N > n) = 0. 

Now, let Z, = Xn if n < N and let Z, = Y, ifn > N. It is 
easy to see that Z,,n > 0 is also a Markov chain with transition 
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probabilities P;; and has Zp = 1. Now 


l 


PY P(Zp oe j) 
P(Zn =j9,N <n) + P(Z, =j,N >n) 
P(¥n =j,N <n)+P(Z, =j,N > 1) 
P(Yn = J) + P(N > n) 
1; + P(N >n) (5.2) 


IA tt Ml 


On the other hand 


P(Yn = j) 

P(Y¥n =j,N <n)+ P(Yn =j,N > 7n) 

P(Zy =j,N <n) + P(Y, =5,N >7) 

P(Zn =j) + P(N > n) 

P”) + P(N > n) (5.3) 


MN 


IA We oil 


Hence, from (5.2) and (5.3) we see that 


lim Py = Tj. 


Remark 5.17 It follows from Theorem 5.13 that if we have an irre- 


ducible Markov chain, and we can find a solution of the stationarity 
equations 


t= S| miPig J 2 0 
z 
aT 
z 


then the Markov chain is positive recurrent, and the 7; are the unique 
stationary probabilities. If, in addition, the chain is aperiodic then 
the 7; are also limiting probabilities. 


1 


Remark 5.18 Because p; is the mean number of transitions between 
successive visits to state i, it is intuitive (and will be formally proven 
in the chapter on renewal theory) that the long run proportion of 
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time that the chain spends in state 7 is equal to 1/p;. Hence, the 


stationary probability 7; is equal to the long run proportion of time 
that the chain spends in state 2. 


Definition 5.19 A positive recurrent, aperiodic, irreducible Markov 
chain is called an ergodic Markov chain. 


Definition 5.20 A positive recurrent irreducible Markov chain whose 
initial state is distributed according to its stationary probabilities is 
called a stationary Markov chain. 


5.6 Time Reversibility 
A stationary Markov chain X,, is called time reversible if 
PUXG = 9 Xn =H 1) = Past = 7 Xn = 2) forall. 3, 7: 


By the Markov property we know that the processes Xn+1, Xn+2,... 
and Xn_1, Xn—2,... are conditionally independent given X,,, and so 
it follows that the reversed process Xn_1,Xn—2,... will also be a 
Markov chain having transition probabilities 


P(Xn —_ 4, Xn4+1 = 1) 
P(Xn41 = 1) 
_ Ty Pii 


.* 5 
ug; 


P(Xn = j|Xn4i = 1) = 


where 7; and P;; respectively denote the stationary probabilities and 
the transition probabilities for the Markov chain X,. Thus an equiv- 
alent definition for X, being time reversible is if 


Ti Piy = 15 P33 for all ted. 


Intuitively, a Markov chain is time reversible if it looks the same 
running backwards as it does running forwards. It also means that 
the rate of transitions from i to j, namely 7;Pj;, is the same as the 
rate of transitions from 7 to i, namely 1;P;;. This happens if there 
are no “loops” for which a Markov chain is more likely in the long 
run to go in one direction compared with the other direction. We 
illustrate with an examples. 
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Example 5.21 Random walk on the circle. Consider a particle which 
moves around n positions on a circle numbered 1, 2,...,n according 
to transition probabilities Pii41 = p = 1— Piz for 1 <i<n 
and Phi = p = 1— Pin. Let X, be the position of the parti- 
cle at time n. Regardless of p it is easy to see that the station- 
ary probabilities are 7; = 1/n. Now, for 1 < i < n we have 
Ti Py i+ = p/n and Titi Piri = (1 = p)/n (and also Trl ni = p/n 
and Pin = (1—p)/n). If p= 1/2 these will all be equal and X, 
will be time reversible. On the other hand, if p 4 1/2 these will not 
be equal and X,, will not be time reversible. 


It can be much easier to verify the stationary probabilities for 
a time reversible Markov chain than for a Markov chain which is 
not time reversible. Verifying the stationary probabilities 7; for a 
Markov chain involves checking }°; 7; = 1 and, for all j, 


tj= ) TP ;;. 
z 


For a time-reversible Markov chain, it only requires checking that 
a iPij = 7 j Pji 


for all 1,7. Because if the preceding holds, then summing both sides 


over j yields 
mi D Py = Dmg Pii 
j j 


y= ) 105 Pi. 
j 


This can be convenient in some cases, and we illustrate one next. 


or 


Example 5.22 Random walk on a graph. Consider a particle mov- 
ing on a graph which consists of nodes and edges, and let d; be the 
number of edges emanating from node i. If Xy is the location of the 
particle at time n, let P(Xn41 = j|Xn = i) = 1/d; if there is an 
edge connecting node z and node j. This means that when at a given 
node, the random walker’s next step is equally likely to be to any of 
the nodes which are connected by an edge. If D is the total number 
of edges which appear in the graph, we will show that the stationary 
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probabilities are given by 7; = a. 


Solution. Checking that 7;P,; = 1jPj; holds for the claimed so- 
lution, we see that this requires that 


Gi ee 
2Dd; 2D d; 
It thus follows, since also 5), 7; = ih = 1, that the Markov chain 


is time reversible with the given stationary probabilities. m 


5.7 A Mean Passage Time Bound 


Consider now a Markov chain whose state space is the set of non- 
negative integers, and which is such that 


Py =0, 0<i<j (5.4) 


That is, the state of the Markov chain can never strictly increase. 
Suppose we are interested in bounding the expected number of tran- 
sitions it takes such a chain to go from state n to state 0. To obtain 
such a bound, let D; be the amount by which the state decreases 
when a transition from state 7 occurs, so that 


P(D; =k) = Pyi-k, 0S Kk <i 
The following proposition yields the bound. 


Proposition 5.23 Let N, denote the number of transitions it takes 


a Markov chain satisfying (5.4) to go from state n to state 0. If for 


some nondecreasing function d;,i > 0, we have that E[D;| > dj, 
then 


E[Nn] < 3 1/d; 
i=l 


Proof. The proof is by induction on n. It is true when n = 1, because 
N, is geometric with mean 


a_i 
Pio =E[D,] 


y 


1 
= <— 
E[Nj] SF 
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So, assume that E[N;,] < San 1/d;, for all k < n. To bound E[N,], 
we condition on the transition out of state n and use the induction 
hypothesis in the first inequality below to get 


E[Np] 
=) E[NnlDn = j]P(Dn = J) 
j=0 
=1+ 50 E[Np-j]P(Dn = 3) 


j=0 


=1+ PanE| Na] + 92 EIN n—j]P(Dn = 3) 
y=] 


n n-Jj 
< 14+ PanE[Nn] + S> P(Dn =35) >_ 1/d; 
j=l i=] 


n 


= 1+ PanE[Nn] + 7 P(Dn = DDS /di- SY) 1/di] 


i=n—jt+l 
<1+ PanE| Na] + 92 P(Da = )IS> A/a — j/dn 
j=l i=l 


where the last line follows because d; is nondecreasing. Continuing 
from the previous line we get 


nr nr 
1 . . 
= 1+ PynE[Nn] + (1 — Pan) * 1/d; — 2 jP(Dn = 5) 
i n j 


= 1+ PapE [Ne] + (1 - Pax) Me ~ BD 


< PynE[Nn] + (1 — Pan) o 1/d; 
t=1 


which completes the proof. m 
Example 5.24 At each stage, each of a set of balls is independently 


put in one of n urns, with each ball being put in urn i with probability 
Pi, >.;-1 Pi = 1. After this is done, all of the balls in the same urn 
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are coalesced into a single new ball, with this process continually 
repeated until a single ball remains. Starting with N balls, we would 
like to bound the mean number of stages needed until a single ball 
remains. 

We can model the preceding as a Markov chain {X;,,k > 0}, 
whose state is the number of balls that remain in the beginning 
of a stage. Because the number of balls that remain after a stage 
beginning with 7 balls is equal to the number of nonempty urns when 
these 7 balls are distributed, it follows that 


n 
E>. I{urn j is nonempty}|X; = i 
j=l 


E[Xn41|X~ = 7] 


n 
S- P(urn 7 is nonempty|X; = 7) 
j=l 


Hence, E[D;], the expected decrease from state 7 is 


E[D,| =t—-nt+ Sol = p;)" 
j=1 


nm 
E(Di+1] — E[Di] = 1- >) p;(1 — p;)' > 0 
j=l 
it follows from Proposition 5.23 that the mean number of transitions 
to go from state N to state 1 satisfies 


n 


E[Nn] < >_ : 


mot + Viger 0 — pj) 
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5.8 Exercises 


1. 


Let f,; denote the probability that the Markov chain ever makes 
a transition into state j given that it starts in state 1. Show 
that if i is recurrent and communicates with 7 then fj; = 1. 


. Show that a recurrent class of states of a Markov chain is a 


closed class, in the sense that if 7 is recurrent and 7 does not 
communicate with 7 then P;; = 0. 


. The one-dimensional simple random walk is the Markov chain 


Xn,n > 0, whose states are all the integers, and which has the 
transition probabilities 


Pega = 1—Pi-1 =p 


Show that this chain is recurrent when p = 1/2, and transient 
for all p # 1/2. When p = 1/2, the chain is called the 1- 
dimensional simple symmetric random walk. 

Hint: Make use of Stirling’s approximation, which states that 


ni w nttl/2—-2,/oq 


where we say that an ~ by if limp—oo an/bn = 1. You can also 
use the fact that if a, > 0,b, > 0 for all n, then a, ~ by, implies 
that 5°, @n < oo if and only if 50, bn < oo. 


. The 2-dimensional simple symmetric random walk moves on a 


two dimensional grid according to the transition probabilities 


Pug ade.) = Pag),641,3) = Pan,a-1a = Pag),G5-1) = 1/4 


Show that this Markov chain is recurrent. 


. Define the three-dimensional simple symmetric random walk, 


and then show that it is transient. 


. Cover times. Given a finite state Markov chain X,, let T; = 


min{n > 0: X, =i} and C = max; Tj. 
(a) Show that for any subset of states A 


|A| 
ae CIAO - MicAjeA BilT5], 
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10. 


11. 


L035 


where |A| denotes the number of elements in A. 
(b) Obtain a lower bound for the mean number of flips required 


until all 2* patterns of length k have appeared when a fair coin 
is repeatedly flipped. 


. Consider a Markov chain whose state space is the set of non- 


negative integers. Suppose its transition probabilities are given 


by 
Poi=pi,t20, Pyi-1=1,1>0 


where )),ip;j < oo. Find the limiting probabilities for this 
Markov chain. 


. Consider a Markov chain with states 0,1,...,N and transition 


probabilities 
Pon = 1, Py =1/i,i>0, 5 <i 
That is, from state 0 the chain always goes to state N, and from 


state i > 0 it is equally likely to go to any lower numbered state. 
Find the limiting probabilities of this chain. 


. Consider a Markov chain with states 0,1,...,.N and transition 


probabilities 
Pai = pa 1a Fi te a VA 


‘ Poo = Pun = 1 


Suppose that Xo = i, where 0 < i < N. Argue that, with 
probability 1, the Markov chain eventually enters either state 
0 or N. Derive the probability it enters state N before state 0. 
This is called the gambler’s ruin probability. 


If X,, is a stationary ergodic Markov chain, show that X,, X9,... 
is an ergodic sequence. 


Suppose X,,X2,... are iid integer valued random variables 
with M, = maxi<n Xj. Is Mn necessarily a Markov chain’? 
If yes, give its transition probabilities; if no, construct a coun- 
terexample. 
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12. 


13. 


14. 


15. 


16. 
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Suppose X,, is a finite state stationary Markov chain, and let 
T = min{n > 0: X, = Xo}. Compute ET]. 


Hastings-Metropolis algorithm. Given an irreducible Markov 
chain with transition probabilities P;; and any positive proba- 
bility vector {7;} for these states, show that the Markov chain 
with transition probabilities Qi; = min(Pi;,7j;Pji/mi) if i # J 
and Qj; =1- isi Qi;, is time reversible and has stationary 
distribution {7;}. 


Consider a time reversible Markov chain with transition prob- 
abilities P;; and stationary probabilities 7;. If A is a set of 
states of this Markov chain, then we define the A-truncated 
chain as being a Markov chain whose set of states is A and 
whose transition probabilities PS, i,7 € A, are given by 
PA = { Pi oo 

Fa Pit DingaPia if j =i 


If this truncated chain is irreducible, show that it is time re- 
versible, with stationary probabilities 


ni = mi! > m5, t€A 
GEA 


A collection of M balls are distributed among m urns. At each 
stage one of the balls is randomly selected, taken from whatever 
urn it is in and then randomly placed in one of the other m—1 
urns. Consider the Markov chain whose state at any time is 
the vector (n1,72,...,%m) where n; is the number of balls in 
urn i. Show that this Markov chain is time reversible and find 
its stationary probabilities. 


Let Q be an irreducible symmetric transition probability matrix 
on the states 1,...,n. That is, 


Qij = Qji, 1.9 = lyicagn 


Let b;,2 = 1,...,n be specified positive numbers, and consider 
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17. 


a Markov chain with transition probabilities 


b; as 
Py = 8 Raby j Fi 


Py =1-)— Py 
jFi 
Show that this Markov chain is time reversible with stationary 
probabilities 
b; 


pa (1g 
jal 7) 


y= 


Consider a Markov chain whose state space is the set of positive 
integers, and whose transition probabilities are 


1 
Pi, = 1, 6 ae aa ae a ae 


Show that the bound on the mean number of transitions to go 
from state n to state 1 given by Proposition 5.23 is approxi- 
mately twice the actual mean number. 


Chapter 6 


Renewal Theory 


6.1 Introduction 


A counting process whose sequence of interevent times are indepen- 
dent and identically distributed is called a renewal process. More 
formally, let X 1, X2,... be a sequence of independent and identically 
distributed nonnegative random variables having distribution func- 
tion F. Assume that F(0) # 1, so that the X; are not identically 0, 


and set 


So = 0 


> Xi, n> 1 
w=1 


Sn 


With 
N(t) = sup(n: Sy < t) 


the process {N(t),¢ > 0} is called a renewal process. 

If we suppose that events are occurring in time and we interpret 
X,y, as the time between the (n—1)* and the n“ event, then S,, is the 
time of the n‘" event, and N(t) represents the number of events that 
occur before or at time ¢. An event is also called a renewal, because 
if we consider the time of occurrence of an event as the new origin, 
then because the X; are independent and identically distributed, the 
process of future events is also a renewal process with interarrival 
distribution F’. Thus the process probabilistically restarts, or renews, 
whenever an event occurs. 
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Let p = E|X;]. Because P(X; > 0) = 1 and P(X; = 0) < 1 it 
follows that p > 0. Consequently, by the strong law of large numbers 


lim S,/n=p>0 
n—- OO 


implying that 

lim S, = 0o 

nm—- oO 
Thus, with probability 1, S, < t for only a finite number of n, 
showing that 


P(N(t) <co) =1 


and enabling us to write 
N(t) = max(n: Sp < t) 


The function 
m(t) = E[N(t)] 


is called the renewal function. We now argue that it is finite for all 
t. 


Proposition 6.1 
m(t) < oo 
Proof. Because P(X; < 0) < 1, it follows from the continuity 


property of probabilities that there is a value @ > O such that 
P(X; > B) > 0. Let 


X; oF B Itx;>p} 


and define the renewal process 
N(t) = max(n: X, +...+ Xn <t) 


Because renewals of this process can only occur at integral multiples 
of 8, and because the number of them that occur at the time nf is 
a geometric random variable with parameter P(X; > {), it follows 
that 1/841 
. + 
E\|N(t)| < ————~ < c€0 
I) P(X; = B) 


Because X; < Xj,i > 1, implies that N(t) < M(t), the result is 
proven. @ 
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6.2 Some Limit Theorems of Renewal Theory 


In this section we prove the strong law and the central limit theorem 
for renewal processes, as well as the elementary renewal theorem. 
We start with the strong law for renewal processes, which says that 


N(t)/t converges almost surely to the inverse of the mean interevent 
time. 


Proposition 6.2 The Strong Law for Renewal Processes 
With probability 1, 


Proof. Because S,, is the time of the n event, and N(t) is the num- 
ber of events by time ¢, it follows that Sy) and Syiz)41 represent, 
respectively, the time of the last event prior to or at time t and the 
time of the first event after t. Consequently, 


Swit) St < Sniyas 


implying that . é 
t 
N(t) < N(t)+1 (6.1) 
N(t) — N(t) N(t) 
Because N(t) gs 00 as t — ov, it follows by the strong law of 
large numbers that 


Sut) _ Art. +A eds 
Nt) W(t) ae 


Similarly, 


Sniygi  Xit.-..-+Xny4i N(t)+1 


Nit) N(t) +1 Na Se Bee 


and the result follows. sg 


Example 6.3 Suppose that a coin selected from a bin will on each 
flip come up heads with a fixed but unknown probability whose prob- 
ability distribution is uniformly distributed on (0,1). At any time 
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the coin currently in use can be discarded and a new coin chosen. 
The heads probability of this new coin, independent of what has 
previously transpired, will also have a uniform (0,1) distribution. If 
one’s objective is to maximize the long run proportion of flips that 
land heads, what is a good strategy? 


Solution. Consider the strategy of discarding the current coin when- 
ever it lands on tails. Under this strategy every time a tail occurs 
we have a renewal. Thus, by the strong law for renewal processes, 
the long run proportion of flips that land tails is the inverse of pz, the 
mean number of flips until a selected coin comes up heads. Because 


w= [ w= 00 


it follows that, under this strategy, the long run proportion of coin 
flips that come up heads is 1. m 


The elementary renewal theorem says the E[N(t)/t] also con- 
verges to 1/y. Before proving it, we will prove a lemma. 


Lemma 6.4 Wald’s equation. Suppose that X, > 1 are iid with 
finite mean E[X], and that N is a stopping time for this sequence, in 


the sense that the event {N > n—1} is independent of Xn, Xn41,---, 
for all n. If E[N] < oo, then 


aly. Xj] = E[NJE[X] 


Proof. To begin, let us prove the lemma when the X; are replaced 
by their absolute values. In this case, 


N 
El) — |Xall 
n=1 


E\y” Xn pnv>n}] 


n=1 


= E| lim | > IXnlZpv>n}l 


Jim E\)_ IXnlTn>n}]) 


n=1 
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where the monotone convergence theorem (1.40) was used to justify 
the interchange of the limit and expectations operations in the last 
equality. Continuing we then get 


N m 
Ey” Xn] dim > E||Xall~w>n-1}] 


Le, @) 


S > El XnllEUwsn—1y] 


n=1 


EX] }> P(N > n-1)] 


n=1 
= E|l|X|JELN] 


< OW. 


But now we can repeat exactly the same sequence of steps, but with 
X; replacing |X;|, and with the justification of the interchange of the 
expectation and limit operations in the third equality now provided 


by the dominated convergence theorem (1.35) upon using the bound 
on=1 Xnl {non} S ae |Xi|. = 


Proposition 6.5 The Elementary Renewal Theorem 
lim —— = — (where 2G) 
t ph oe 
Proof. Suppose first that pp < oo. Because 
N(t) +1=min(n: S, > t) 
it follows that N(t) + 1 is a stopping time for the sequence of in- 


terevent times X 1, X2,... . Consequently, by Wald’s equation, we 
see that 


E|Sn(e)41] = plm(t) + 1 


Because Sy(z)41 > t the preceding implies that 
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We will complete the proof by showing that limsup,_.,, mit) <s, 


rT; 
Towards this end, fix a positive constant M and define a related 
renewal process with interevent times X,, n > 1, given by 

X, = min(X,, M) 
Let 


n 
Sn= >) Xi,  N(t) = max(n: Sn < 2). 


Because an interevent time of this related renewal process is at most 
M, it follows that 
Swt+1) St+M 


Taking expectations and using Wald’s equation yields 
em[m(t)+ 1} <t+M 


where fi = E[X,] and m(t) = E[N(t)]. The preceding equation 
implies that 
lim sup de) < a 
t—00 t LLM 
However, Xn < Xn, n> 1, implies that N(t) > N(t), and thus that 
m(t) > m(t). Thus, 


1 
imp 2 (6.2) 
t—00 LM 


Now, 
min(X;,M)fX, as Mtfoo 
and so, by the dominated convergence theorem, it follows that 
im —> pb as M-—oo 


Thus, letting M — oo in (6.2) yields 


lim sup —— mn mee) 
t— oo =a 


Thus, the result is established a pt < co. When p = oo, again con- 
sider the related renewal process with interarrivals min(X,, M). Us- 
ing the monotone convergence theorem we can conclude that py = 
FE|min(X,,M)| — co as M — co. Consequently, (6.2) implies that 


mt) 


lim sup —— = 0 
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and the proof is complete. m 


If the interarrival times X;,7 > 1, of the counting process N(t),t > 
0, are independent, but with X, having distribution G, and the other 
X; having distribution F’, the counting process is said to be a delayed 
renewal process. We leave it as an exercise to show that the analogs 
of the the strong law and the elementary renewal theorem remain 


valid. 


Remark 6.6 Consider an irreducible recurrent Markov chain. For 
any state 7, we can consider transitions into state 7 as constituting 
renewals. If Xo = j, then Nj(n),n > 0, would be a renewal process, 
where N;(n) is the number of transitions into state 7 by time n; if 
Xo # Jj, then N;(n),n > 0 would be a delayed renewal process. The 
strong law for renewal processes then shows that, with probability 
1, the long run proportion of transitions that are into state 7 is 
1/43, where 4; is the mean number of transitions between successive 
visits to state 7. Thus, for positive recurrent irreducible chains the 
stationary probabilities will equal these long run proportions of time 
that the chain spends in each state. 


Proposition 6.7 The Central Limit Theorem for Renewal Processes 
If » and o7, assumed finite, are the mean and variance of an in- 


terevent time, then N(t) is asymptotically normal with mean t/p 
and variance to? /u?. That is, 


y 
lim P(———— Aa Meal ZEs eo? dg 


1 
eee an 


Proof. Let r; = t/u+ yo./t/p3. If rz is an integer, let nz = 7;; if 
r; is not an integer, let nz = [r:| + 1, where [z] is the largest integer 
less than or equal to x. Then 


p(X = t/p 


oy/t/ ps3 


<y) = P(N(t) <r) 


P(N(t) < nt) 
P(Sn, > t) 


a/nt o./n 
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where the preceding used that the events {N(t) < n} and {S, > 
t} are equivalent. Now, by the central limit theorem, ane 
converges to a standard normal random variable as n; approaches 
oo, or, equivalently, as t approaches oo. Also, 


t— nt im ~— "th 


li = 
cs a./nt t00 0,/Tt 
ahs t/43 
= lim yr t/ 
— 00 
Vt/et+ yov/t/p 
=-y 


Consequently, with Z being a standard normal random variable 


lim pee 


t>00 a y/t/p3 


and the proof is complete. mg 


<y)=P(Z>-y)=P(Z<y) 


6.3. Renewal Reward Processes 


Consider a renewal process with interarrival times X,,n > 1, and 
suppose that rewards are earned in such a manner that if R, is the 
reward earned during the n“” renewal cycle - that is, during the time 
from S,_1 to S, - then the random vectors (Xn, R,) are independent 
and identically distributed. The idea of this definition is that the 
reward earned during a renewal cycle is allowed to depend on what 
occurs during that cycle, and thus on its length, but whenever a 
renewal occurs the process probabilistically restarts. Let R(t) denote 
the total reward earned by time f. 


Theorem 6.8 Jf E[R,| and E[X,| are both finite, then 


@) ea 
t *° E[Xy| 
E\R(@)| ER] 


t B[Xi] 


as t—c 


(5) 


t — co 
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Proof. To begin, let us suppose that the reward received during a 
renewal cycle is earned at the end of that cycle. Consequently, 


N(t) 


R(t) = ) | Rn 


n=1 
and thus 
R(t) _ Sone} Rn N(t) 


t N(t)) st 
Because N(t) — co as t > 00, it follows from the strong law of large 
numbers that ne 
t 
den=i Rn Rn aE Oe E[ Ry] 
— N@) 


Hence, part (a) follows by the strong law for renewal processes. 
To prove (b), fix 0 < M < oo, set Rj; = min(R;,M), and let 
R(t) = ™ R;. Then, 


E(R(t)] = E[R()] 
N(t)+1 


El >) Ril — ElRww4il 
i=1 


= [m(t) + JE[Ri] — E[Rn 4] 


where the final equality used Wald’s equation. Because R N(t)t1 < 
M, the preceding yields 


ER) 5 m(t)+1 
t 


_  M 
-: = nee 


Consequently, by the elementary renewal theorem 


E(R()) . ELA] 


lim infy—.00 ; oe E| x] 


By the dominated convergence theorem, limy—. E[R,] = E[Ri] 
yielding that 


EIR()| 5 ElRi] 
t 


lim inft—.00 - E| X] 
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Letting R*(t) = —R(t) = ya R;) yields, upon repeating the 
same argument, 


lim infs—.60 


BIR*()) . BI-Ral 
t ~~ ElX] 


or, equivalently, 


EIR())  ELRi] 
t — EX 


lim SUP;_.90 


Thus, 
fn EDoit Ril _ ELA} 
too t  # [X] 
proving the theorem when the entirety of the reward earned during 


a renewal cycle is gained at the end of the cycle. Before proving the 
result without this restriction, note that 


(6.3) 


N(t) N(t)+1 
E|>~ Ri) = E| s Ri] — E[Rn41] 
i=l i= 


= E(Ri|E [N(t) + 1] — E[Rny4i] by Wald’s equation 
E[Ri][m(t) + 1) — E[Ryey+1] 


Hence, 


E[R,] - 


Eley Ril _ m(t) +1 
t t 


E(Rnt)+1] 
t 


and so we can conclude from (6.3) and the elementary renewal the- 


orem that 
E(Ryty+1] 
t 


Now, let us drop the assumption that the rewards are earned only 
at the end of renewal cycles. Suppose first that all partial returns 


are nonnegative. Then, with R(t) equal to the total reward earned 
by time t 


0 (6.4) 


Ne oe 


Der Fi. R(t). Di Ri, Ren 
t ~~ oo t t 
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Taking expectations, and using (6.3) and (6.4) proves part (b). Part 
(a) follows from the inequality 


Der Ri NG) RO. De Rs N41 


N(t) ot — t — N(@t)+1 t 


by noting that for 7 = 0,1 


Se Re N(t) +35 E[Ri] 
N(t)+j t  “ E[X] 


A similar argument holds when all partial returns are nonpositive, 
and the general case follows by breaking up the returns into their 


positive and negative parts and applying the preceding argument 
separately to each. m 


Example 6.9 Generating a Random Variable whose Distribution is 
the Stationary Distribution of a Markov Chain. For a finite state ir- 
reducible aperiodic Markov chain X,,n > 0, having transition prob- 
abilities {P;;} and stationary distribution {7;}, Theorem 5.13 says 
that the approximation P(X, = i|Xo = 0) % 7; is good for large n. 
Here we will show how to find a random time T > 0 so that we have 
exactly P(X7 = i|Xo = 0) = 7. 

Suppose that for some p > 0 we have Py > p for all i. (If this 
condition doesn’t hold, theri we can always find an m such that the 
condition holds for the transition probabilities ao implying that 
the condition holds for the Markov chain Y, = Xnm,n > 0, which 
also has the stationary distribution {7;}.) 


To begin, let J, ~ Bernoulli(p) be iid and define a Markov chain 
Y, so that 


P(Yn41 —= OlYn, = 1, 5 we | a 1) = i 


P(Yn41 = O[¥n = 1, Jn41 = 0) = (Pio - p)/(1 — p), 
and, for 7 4 0, 


P(Yn41 — 3|Yn =1,dn41 = 0) aS irre — p). 
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Notice that this gives Y, = 0 whenever J, = 1, and in addition that 


P(Yn41 = j|Yn ae 1) = P(Yni1 = j|Yn = 1, Jn41 ie 0)(1 — p) 
+ P(Yn+1 = 5|Yn = 1, In4i1 = 1)p 
a Pj; 
Thus, both X, and Y,, have the same transition probabilities, and 
thus the same stationary distribution. 

Say that a new cycle begins at time n if J, = 1. Suppose that a 
new cycle begins at time 0, so Yo = O and let Nj; denote the number 
of time periods the chain is in state 7 during the first cycle. If we 
suppose that a reward of 1 is earned each time the chain is in state 


j, then 7; equals the long run average reward per unit time, and the 
renewal reward process result yields that 


—_ EIN; 
= Fer 


where T = min{n > 0: Jn = 1} is the time of the first cycle. Because 
T is geometric with parameter p, we obtain the identity 


nj = pE|N;] 


Now, let I; be the indicator variable for the event that a new cycle 
begins on the transition following the k*” visit to state 7. Note that 


Nj 
Dk = Ityp_,=3} 
k=1 
Because J), Jo,... are iid and the event {N; = n} is independent of 


In+1, In+2,.-- it follows from Wald’s equation that 
P(Yr-1 = j) = E[Nj| Eh] = pE[N;) 
giving the result that 
tj = P(Y7r-1 =j) 


Remark 6.10 In terms of the original Markov chain X,,, set Xo = 0. 
Let U,,U2,... be a sequence of independent uniform (0,1) random 
variables that is independent of the Markov chain. Then define 


T= min(n >O0: Xn = Q, Un, < p/ Px,,_1,0) 
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with the result that P(X7_1 = j|Xo = 0) = 7;. In fact, if we set 
Xo = 0, let Jo = 0, and define 


T; = min(n > Ty-1: Xn = 0,Un < p/Px,_, 00) 
then X7,-1,7 > 1, are iid with 
P(X7,-1 = j|Xo = 0) = 1}. 


Example 6.11 Suppose that X;,i > 1 are iid discrete random vari- 
ables with probability mass function pj = P(X = i). Suppose we 
want to find the expected time until the pattern 1,2,1,3,1,2,1 ap- 
pears. To do so, suppose that we earn a reward of 1 each time the 
pattern occurs. Since a reward of 1 is earned at time n > 7 with 
probability P(X, = 1, Xn-1 = 2, Xn-2 = 1,..., Xn-6 = 1) = pi paps, 
it follows that the long run expected reward per unit time is p{p2p3. 
However, suppose that the pattern has just occurred at time 0. Say 
that cycle 1 begins at time 1, and that a cycle ends when, ignoring 
data from previous cycles, the pattern reappears. Thus, for instance, 
if a cycle has just ended then the last data value was 1, the next to 
last was 2, then 1, then 3, then 1, then 2, then 1. The next cycle will 
_begin when, without using any of these values, the pattern reappears. 
The total reward earned during a cycle, call it R, can be expressed 
a 
R=1+A,+ Ag 


where Ay is the reward earned when we observe the fourth data 
value of the cycle (it will equal 1 if the first 4 values in the cycle are 
3,1,2,1), Ag is the reward earned when we observe the sixth data 


value of the cycle, and 1 is the reward earned when the cycle ends. 
Hence, 


E[R] = 1+ pipeps + pipops 
If T’ is the time of a cycle, then by the renewal reward theorem, 
E 
CS) ee a de 
Pi P2P3 E{T] 
yielding that the expected time until the pattern appears is 
1 1 1 
E [T] eer TE BS + “7 —, 
Pi}P9P3) P{Pp2 Pi 
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Let 
A(t) =t— Sn), Y(t) = Sntygi — t 
The random variable A(t), equal to the time at t¢ since the last re- 
newal prior to (or at) time ft, is called the age of the renewal process 
at t. The random variable Y(t), equal to the time from t until the 
next renewal, is called the excess of the renewal process at t. We now 
apply renewal reward processes to obtain the long run average values 
of the age and of the excess, as well as the long run proportions of 
time that the age and the excess are less than z. 


The distribution function F. defined by 


H i 
Rae =f F(y)dy, 2>0 


is called the equilibrium distribution of the renewal process. 


Proposition 6.12 Let X have distribution F. With probability 1 


ee ae zt E[X?] 
oe i sear [ sa) ea 9 

_ 1 f° 2 E[X?] 
" lim f Peels tim = ff EV (s)lds = oF 

a ee oy 
(c) fim Ff Matsa) ds = jim 5 i P(A(s) < 2)ds = F.(z) 


1 t 1 t 
(d) fim =f Torieay ds = fim =f POW(s) <2)ds = F(a 


Proof: To prove (a), imagine that a reward at rate A(s) is earned at 
time s,s > 0. Then, this reward process is a renewal reward process 
with a new cycle beginning each time a renewal occurs. Because the 
reward rate at a time zx units into a cycle is x, it follows that if X is 


the length of a cycle, then 7’, the total reward earned during a cycle, 
is 


XxX 
T= [ adxz = X?/2 
0 


As limt_.oo : {, A(s)ds is the long run average reward per unit time, 
part (a) follows from the renewal reward theorem. 
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To prove (c) imagine that we earn a reward at a rate of 1 per unit 
time whenever the age of the renewal process is less than z. That 
is, at time s we earn a reward at rate 1 if A(s) < x and at rate 0 if 
A(s) > x. Then this reward process is also a renewal reward process 
in which a new cycle begins whenever a renewal occurs. Because we 
earn at rate 1 during the first x units of a renewal cycle and at rate 0 
for the remainder of the cycle, it follows that, with T being the total 
reward earned during a cycle, 


E{(T|] = E[min(X, z)| 


‘a P(min(X, x) > t) dt 
0 


i : F(t)dt 


Because  limz+oo : Ss I; a(s)<z}@s is the average reward per unit 
time, part (c) follows from the renewal reward theorem. 


We will leave the proofs of parts (b) and (d) as an exercise. 


6.3.1 Queueing Theory Applications of Renewal Reward 
Processes 


Suppose that customers arrive to a system according to a renewal 
process having interarrival distribution F’, with mean 1/X. Each ar- 
riving customer is eventually served and departs the system. Suppose 
that at each time point the-system is in some state, and let S(t) de- 
note the state of the system at time t. Suppose that when an arrival 
finds the system empty of other customers the evolution of system 
states from this point on is independent of the past and has the 
same distribution each time this event occurs. (We often say that 
the state process probabilistically restarts every time an arrival finds 
the system empty.) Suppose that such an event occurs at time 0. 

If we suppose that each arrival pays an amount to the system, 
with that amount being a function of the state history while that 
customer is in the system, then the resulting reward process is a 
renewal reward process, with a new cycle beginning each time an 
arrival finds the system empty of other customers. Hence, if R(t) 
denotes the total reward earned by time t, and T denotes the length 
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of a cycle, then 


E|R(T)] 
E\T] 


average reward per unit time = (6.5) 
Now, let A; denote the amount of money paid by customer 1, for i > 
1, and let N denote the number of customers served in a cycle. Then 
this sequence R,, Ro,... can be thought of as the reward sequence 
of a renewal reward process in which R; is the reward earned during 


period i and the cycle time is N. Hence, by the renewal reward 
process result 


lim Ri+...+ Rp _ Ee Eliz Fi] 


Jim, EN (6.6) 


To relate the left hand sides of (6.5) and (6.6), first note that because 


R(T) and ea _; 4; both represent the total reward earned in a cycle, 
they must be equal. Thus, 


N 
EIR(T)] = EL) Fi 


Also, with customer 1 being the arrival at time 0 who found the 
system empty, if we let X; denote the time between the arrivals of 
customers 71 and i+ 1, then 


Because the event {N = n} is independent of all the sequence X;, k > 
n+1, it follows that it is a stopping time for the sequence X;,i > 1. 
Thus, Wald’s equation gives that 


E\N 
E|T| = E[N\E[X] = ae 
and we have proven the following 


Proposition 6.13 


average reward per unit time = AR 
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where " 
R= lim oid Waren 


nln—+ 00 Ti 


is the average amount that a customer pays. 


Corollary 6.14 Let X(t) denote the number of customers in the sys- 
tem at time t, and set 


= lim — an X(s)ds 


too t 


Also, let W; denote the amount of time that customer i spends in the 
system, and set 
W = lim Wet ie 


n— OO Ti 


Then, the preceding limits exist, L and W are both constants, and 
L= AW 


Proof: Imagine that each customer pays 1 per unit time while in the 
system. Then the average reward earned by the system per unit time 
is L, and the average amount a customer pays is W. Consequently, 
the result follows directly from Proposition 6.13. a 


6.4 Blackwell’s Theorem 


A discrete interarrival distribution F' is said to be lattice with period 
d if Soop P(Xi = nd) = 1 and d is the largest value having this 
property. (Not every discrete distribution is lattice. For instance the 
two point distribution that puts all its weights on the values 1 and z - 
or any other irrational number - is not lattice.) In this case, renewals 
can only occur at integral multiples of d. By letting d be the new 
unit of time, we can reduce any lattice renewal process to one whose 
interarrival times put all their weight on the nonnegative integers, 
and are such that the greatest common divisor of {n : P(X; = n) > 
0} is 1. (If y’ was the original mean interarrival time then in terms of 
our new units the new mean p would equal p’/d.) So let us suppose 
that the interarrival distribution is lattice with period 1, let p; = 
P(X; = 9),j 2 0, and p= yj IP5- 
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Theorem 6.15 Blackwell’s Theorem If the interarrival distribution 
is lattice with period 1, then 


— Po 
rT) 


lim P(a renewal occurs at time n) = 
nm—- OO 


and 


1 
lim E[number of renewals at time n| = — 
n—-OO 


Proof. With A, equal to the age of the renewal process at time n, 


it is easy to see that A,,n > 0, is an irreducible, aperiodic Markov 
chain with transition probabilities 


Pio = P(X = i|X 2 i) = = 1- Pasa, 120 
j2i¥s 


The limiting probability that this chain is in state 0 is 


1 
"0 ELX[X > 0] 


where X is an interarrival time of the renewal process. (The mean 
number of transitions of the Markov chain between successive visits 
to state 0 is E[X|X > 0] since an interarrival that is equal to 0 is 
ignored by the chain) Because a renewal occurs at time n whenever 
An = 0, the first part of the theorem is proven. The second part of 
the theorem also follows because, conditional on a renewal occurring 
at time n, the total number of renewals that occur at that time is 
geometric with mean (1— po)" m= 


6.5 The Poisson Process 


If the interarrival distribution of a renewal process is exponential 
with rate A, then the renewal process is said to be a Poisson process 


with rate A. Why it is called a Poisson process is answered by the 
next proposition. 


Proposition 6.16 If N(t),t > 0, is a Poisson process having rate 4, 
then N(t) =q Poisson(Xt). 
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Proof. We will show P(N(t) = k) = e~**(At)*/k! by induction on 
k. Note first that P(N(t) = 0) = P(X; > t) = e~*. For k > 0, we 
condition on X, to get 


P(N(t) =k) = [ P(N(t) = k|X, = 2)dAe~ dz 


a [ P(N(t — 2) = k—1)de~** dx 
0 


teAME*)(\(t—z))F1 
=| =e ye dz 
t k-1 
_ yk-xt [ (£-2) 
= Ae 5 ke dx 


= e~*t()t)* /k! 
which completes the induction proof. m 


The Poisson process is often used as a model of customer arrivals 
to a queuing system, because the process has several properties that 
might be expected of such a customer arrival process. The process 
N(t),t > 0, is said to be a counting process if events are occurring 
randomly in time and N(t) denotes the cumulative number of such 
events that occur in [0,t]. The counting process N(t),t > 0, is said 
to have stationary increments if the distribution of N(¢+ s) — N(t) 
does not depend on ¢, and-is said to have independent increments 
if N(t; + s;) — N(t;), for 7 = 1,2,..., are independent random vari- 
ables whenever ¢;4; > ¢; + s; for all 1. Independent increments says 
that customer traffic in one interval of time does not affect traffic in 
another (disjoint) interval of time. Stationary increments says that 
the traffic process is not changing over time. Our next proposition 
shows that the Poisson process is the only possible counting process 
with continuous inter-arrival times and with both properties above. 


Proposition 6.17 The Poisson process is the only counting process 


with stationary, independent increments and continuous inter-arrival 
tumes. 
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Proof. Given a counting process N(t),¢ > 0, with continuous inter- 
arrival times X; and stationary, independent increments, then 


P(X, >t+8|X, >t) = P(N(t+s) = O|N(t) = 0) 
= P(N(t+s) — N(t) = 0|N(¢) = 0) 
= P(N(t +s) — N(t) = 0) 
= P(N(s) = 0) 
= P(X, > 8), 


where the third equality follows from independent increments, and 
the fourth from stationary increments. Thus, we see that X, is mem- 
oryless. Because the only memoryless continuous random variable is 
the exponential distribution, X, is exponential. Because 


P(X2 > 3|X, = i) = P(N(t +s)— N(t) = OLX, = t) 
= P(N(t+s) — N(t) = 0) 
= P(N(s) = 0) 
= P(X, > s), 
it follows that X92 is independent of X; and has the same distribution. 


Continuing in this way shows that all the X; are iid exponentials, so 
N(t),t > 0, is a Poisson process. m 
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6.6 Exercises 


1. 


Consider a renewal process N(t) with Bernoulli(p) inter-event 
times. 

(a) Compute the distribution of N(t). 

(b) With S;,7 > 1 equal to the time of event 7, find the condi- 


tional probability mass function of Sj, ..., 5, given that N(n) = 
k 


. For a renewal process with inter-event distribution F' with den- 


sity F’ = f, prove the renewal equation 
y 


m(t) = F(t) + | m(t — x) f(x)dx 


. For a renewal process with inter-event distribution F', show 


that 
P(Xnit)41 > 2) 2 1 — F(z) 
The preceding states that the length of the renewal interval that 


contains the point t is stochastically larger than an ordinary 
renewal interval, and is called the inspection paradox. 


. With X,,X2... independent U(0,1) random variables with 


Sn = Dien Xi and N = min {n: S, > 1}, show that E[Sy] = 
e/2. 


. A room in a factory has n machines which are always all turned 


on at the same time, and each works an independent exponen- 
tial time with mean m days before breaking down. As soon as 
k machines break, a repairman is called. He takes exactly d 
days to arrive and he instantly repairs all the broken machines 
he finds. Then this cycle repeats. 

(a) How often in the long run does the repairman get called? 
(b) What is the distribution of the total number of broken ma- 
chines the repairman finds when he arrives? 

(c) What fraction of time in the long run are there more than 
k broken machines in the room? 


. Each item produced is either defective or acceptable. Initially 


each item is inspected, and this continues until k consecutive 
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acceptable items are discovered. At this point, one hundred 
percent inspection stops and each new item produced is, inde- 
pendently, inspected with probability a. This continues until 
a defective item is found, at which point we go back to one 
hundred percent inspection, with the inspection rule repeating 
itself. If each item produced is, independently, defective with 
probability g, what proportion of items are inspected? 


. A system consists of two independent parts, with part i func- 


tioning for an exponentially distributed time with rate A; before 
failing, i = 1,2., The system functions as long as at least one of 
these two parts are working. When the system stops function- 
ing a new system, with two working parts, is put into use. A 
cost K is incurred whenever this occurs; also, operating costs 
at rate c per unit time is incurred whenever the system is op- 
erating with both parts working, and operating costs at rate c; 
is incurred whenever the system is operating with only part 1 
working, i = 1,2. Find the long run average cost per unit time. 


. If the interevent times X;,i > 1, are independent but with X, 


having a different distribution from the others, then {Ng(t),t > 
0} is called a delayed renewal process, where 


Na(t) = sup{n: > Xi <} 


i=1 


Show that the strong law remains valid for a delayed renewal 
process. 


. Prove parts (b) and (d) of Proposition 6.12. 
10. 


Consider a renewal process with continuous inter-event times 
X 1, X2,... having distribution F’. Let Y be independent of the 
X; and have distribution function F.. Show that 


E|{min{n : X, > Y}] = sup{z: P(X > x) > O}/E[X] 


where X has distribution F’. How can you interpret this re- 
sult? 


Someone rolls a die repeatedly and adds up the numbers. Which 
is larger: P(sum ever hits 2) or P(sum ever hits 102)? 
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12. 


13. 


14. 


15. 


16. 


If {N,(t),¢ > 0}, i= 1,...,& are independent Poisson processes 
with respective rates A;,1 <1< k, show that Rae N;,(t),t > 
0, is a Poisson process with rate \ = YS ri. 


A system consists of one server and no waiting space. Cus- 
tomers who arrive when the server is busy are lost. There are 
n types of customers: type 7 customers arrive according to a 
Poisson process with rate 4; and have a service time distribu- 
tion that is exponential with rate y;, with the n Poisson arrival 
processes and all the service times being independent. 

(a) What fraction of time is the server busy? 

(b) Let X,, be the type of customer (or 0 if no customer) in the 
system immediately prior to the nth arrival. Is this a Markov 
chain? Is it time reversible? 


Let X;,7 = 1,...,n be iid continuous random variables having 
density function f. Letting 


Xi) = i? smallest value of Serre, oF 


the random variables X(),...,X(,) are called order statistics. 
Find their joint density function. 


Let S; be the time of event 7 of a Poisson process with rate . 
(a) Show that, conditional on N(t) = n, the variables S),...,S, 
are distributed as the order statistics of a set of n iid uniform 
(0,¢) random variables. 

(b) If passengers arrive at a bus stop according to a Poisson 
process with rate 4, and the bus arrives at time ¢t, find the 
expected sum of the amounts of times that each boarding cus- 
tomer has spent waiting at the stop. 


Suppose that events occur according to a Poisson process with 
rate A, and that an event occurring at time s is, independent 
of what has transpired before time s, classified either as a type 
1 or as a type 2 event, with respective probabilities p;(s) and 
po(s) = 1— pi(s). Letting N;(t) denote the number of type i 
events by time t, show that N,(t) and No(t) are independent 
Poisson random variables with means E[N;,(t)] = > ie pi(s)ds. 


Chapter / 


Brownian Motion 


7.1 Introduction 


One goal of this chapter is to give one of the most beautiful proofs of 
the central limit theorem, one which does not involve characteristic 
functions. To do this we will give a brief tour of continuous time 
martingales and Brownian motion, and demonstrate how the central 
limit theorem can be essentially deduced from the fact that Brownian 
motion is continuous. 

In section 2 we introduce continuous time martingales, and in 
section 3 we demonstrate how to construct Brownian motion and 
prove it is continuous, and show how the self-similar property of 
Brownian motion leads to an efficient way of estimating the price of 
path-dependent stock options using simulation. In section 4 we show 
how random variables can be embedded in Brownian motion, and in 


section 5 we use this to prove a version of the central limit theorem 
for martingales. 


7.2 Continuous Time Martingales 


Suppose we have sigma fields F; indexed by a continuous parameter 
t so that F, C F; for all s < t. 


Definition 7.1 We say that X(t) is a continuous time martingale for 
F, if for allt andO<s <t we have 


1. E\X(t)| < 
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2. X(t) € Fi 


3. E[X(t)|Fs] = X(s) 


Example 7.2 Let N(t) be a Poisson process having rate A, and let 
F, = o(N(s),0 < s < t). Then X(t) = N(t) — At is a continuous 
time martingale because 


E[N(t) — At|F,] = E[N(t) — N(s) — A(t — s)|F,] + N(s) — As 
= N(s) — As. 


We say a process X(t) has stationary increments if X(t + s) — 
X(t) =q X(s) — X(0) for all t,s > 0. We also say a process X(t) 
has independent increments if X(t,) — X(to), X(t2) — X(t1),... are 
independent random variables whenever tp < t) <---. 

Though a Poisson process has stationary independent increments, 
it does not have continuous sample paths. We show below that it is 
possible to construct a process with continuous sample paths, called 
Brownian motion, which in fact is the only possible martingale with 
continuous sample paths and stationary and independent increments. 
These properties will be key in proving the central limit theorem. 


7.3. Constructing Brownian Motion 


Brownian motion is a continuous time martingale which produces a 


randomly selected path typically looking something like the follow- 
ing. 


i 
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Here we show how to construct Brownian motion B(t) for 0 < t < 1. 
To get Brownian motion over a wider interval, you can just repeat 
the construction over more unit intervals each time continuing the 
path from where it ends in the previous interval. 


Given a line segment we say we “move the midpoint up by the 
amount z” if we are given a line segment connecting the point (a, b) 
to the point (c,d) with midpoint (Sis, bid), and we break it into 
two line segments connecting (a,b) to (45, bed + z) to (c,d). This 
is illustrated below. 


Next let Zn be iid N(0, 1) random variables, for all k,n. We initiate 
a sequence of paths. The zeroth path consists of the line segment 
connecting the point (0,0) to (1,200). For n > 1, path n will con- 
sist of 2” connected line segments, which can be numbered from 
left to right. To go from path n — 1 to path n simply move the 
midpoint of the kth line segment of path n — 1 up by the amount 
Zen/(V2)"+}, k =1,...,2-!. Letting f,(t) be the equation of the 
n*® path, then the random function 


B(t) = lim fa(t) 
is called standard Brownian motion. 


For example if Zoo = 2 then path 0 would be the line segment 
connecting (0,0) to (1,2). This looks like the following. 
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Path 0 


Then if Z,,,/(V2)? = 1 we would move the midpoint (5,1) up to 
(3, 2) and thus path 1 would consist of the two line segments con- 
necting (0,0) to (5,2) to (1,2). This then gives us the following 
path. 


Path 1 


0 = as _ Viet Sy oe pe ee ee 


0 0.25 0.5 0.75 1 


If Z12/(V2)? = —1 and Zo2/(V2)? = 1 then the next path is 
obtained by replacing these two line segments with the four line seg- 
ments connecting (0,0) to (4,0) to (5,2) to ($,3) to (1,2). This 
gives us the following. 
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Path 2 


0 0.25 0.5 0.75 1 


Then the next path would have 8 line segments, and so on. 


Remark 7.3 By this recursive construction it can immediately be 
seen that B(t) is “self similar” in the sense that {B(t/2")(/2)",0 < 
t < 1} has the same distribution as {B(t),0 < t < 1}. This is the 
famous “fractal” property of Brownian motion. 


Proposition 7.4 Brownian motion B(t) is a martingale with station- 
ary, independent increments and B(t) ~ N(0,t). 


Before we prove this we need a lemma. 


Lemma 7.5 If X and Y are iid mean zero normal random variables 


then the pair Y+X and Y —X are also iid mean zero normal random 
variables. 


Proof. Since X,Y are independent, the pair X,Y has a bivariate 
normal distribution. Consequently X —Y and X+Y have a bivariate 
normal distribution and thus it’s immediate that Y + X and Y — X 
are identically distributed normal. Then 


Couo(Y + X,Y —-X)=E|(Y + X)(Y — X)] = Ely? — X7] =0 


gives the result (since uncorrelated bivariate normal random vari- 
ables are independent). m 


Proof. Proof of Proposition 7.4. Letting 


b(k,n) = B(k/2") — B((k — 1)/2") 
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we will prove that for any n the random variables 
b(1,n), 6(2, n), ..., b(2", n) 


are independent and identically distributed Normal(0, 1/2”) random 
variables. After we prove that Brownian motion is a continuous 
function, which we do in the proposition immediately following this 
proof, we can then write 


B(t) — B(s) = lim, . b(k,n). 
k:s+1/2"%<k/2"<t 


It will then follow that B(t) — B(s) ~ N(0,t—s) for t > s and that 


Brownian motion has stationary, independent increments and is a 
martingale. 


We will complete the proof by induction on n. By the first step 
of the construction we get b(1,0) ~ Normal(0,1). We then assume 
as our induction hypothesis that 


b(1,n — 1), b(2,n —1),...,b(2"-',n—-1) 


are independent and identically distributed Normal(0,1/2"—!) ran- 
dom variables. Following the rules of the construction we have 


b(2k — 1,n) = b(k,n — 1)/2 ot: Zope n|(V2)"*} 


and 


b(2k,n) = b(k,n — 1) — b(2k — 1,n) 
= (k,n — 1)/2 — Zoen/(V2)"*, 


which is also illustrated in the figure below; in the figure we write 
Z= Zonn|(V2)"*). 
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[een 


| | b(k,n-1) 
| 
[ 


(k-D)/2"! (2k-1)/2"—k/2”" 


Since b(k,n — 1)/2 and Zox n/(V2)"*! are iid Normal(0, 1/2*') 
random variables, we then apply the previous lemma to obtain that 
b(2k — 1,n) and b(2k,n) are iid Normal(0, 1/2") random variables. 
Since b(2k — 1,n) and b(2k,n) are independent of b(j,n — 1),7 # k, 
we get that 

b(1, n), (2, n), ..., 6(2", n) 
are independent and identically distributed Normal(0, 1/2") random 
variables. 

And even though each function f,(t) is continuous, it is not im- 
mediately obvious that the limit B(t) is continuous. For example, 
if we instead always moved midpoints by a non-random amount 2, 
we would have sup,;<, B(t) — infre, B(t) > x for any interval A, and 
thus B(t) would not be a.continuous function. We next show that 
Brownian motion is a continuous function. 


Proposition 7.6 Brownian motion is a continuous function with prob- 
ability 1. 


Proof. Note that 


P(B(t) is not continuous) 
oo 
< > P(B(t) has a discontinuity larger than 1/1), 
i=1 
and so the theorem will be proved if we show, for any ¢€ > 0, 


P(B(t) has a discontinuity larger than €) = 0. 
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Since by construction f,, is continuous for any given m, in order 
for B(t) to have a discontinuity larger than e« we must have 


sup |B(t) — fm(t)| > €/2 
0<t<1 


or else B(t) would necessarily always be within ¢/2 of the known 
continuous function fj, and it would be impossible for B(t) to have 
a discontinuity larger than e. Letting 


d, = sup lfn—1(t) = fn(t)| 
0<t<1 


be the largest difference between the function at stage n and stage 
n+1, we must then have 


dn, > €(3/4)"/8 for some n > m, 


because otherwise it would mean 
sup |B(t) — fm(t)| < S— dn <€/2 S_ (3/4)"/4 < €/2. 
O<t<l n>m n>m 


Next note by the construction we have 


P(d, > xz) =P ( sup |Zen/(V2)"*}| > :) 
1<k<an-1 


<2"P (Z| > (V2)"+12) 
< exp(n — (V2)"t*2), 
where the last line is for sufficiently large n and we use 2” < e” and 


P(|Z| > x) < e~* for sufficiently large x (see Example 4.5). 
This together means, for sufficiently large m, 


P(B(t) has a discontinuity larger than e€) 


< J P(dn > €(3/4)”/8) 


< } exp(n - €(3V3/4)"/8), 


n>m 


which, since the final sum is finite, can be made arbitrarily close to 
Zero aS m increases. @ 
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Remark 7.7 You may notice that the only property of the standard 
normal random variable Z used in the proof is that P(|Z| > x) < e~” 
for sufficiently large x. This means we could have instead constructed 
a process starting with Z;, having an exponential distribution, and 
we would get a different limiting process with continuous paths. We 
would not, however, have stationary and independent increments. 


As it does for Markov chains, the strong Markov property holds 
for Brownian motion. This means that {B(T + t) — B(T),0 < t} 
has the same distribution as {B(t),0 < t} for finite stopping times 
T < co. We leave a proof of this as exercise 8 at the end of the 
chapter. An easy result using the continuity of Brownian motion 
and the strong Markov property for Brownian motion involves the 
suprememum of Brownian motion. 


Proposition 7.8 P(supg<,<; B(s) > x) = 2P(B(t) > z). 


Proof. Let T = inf{t > 0: B(t) = x}, and note that continuity of 
Brownian motion gives the first line below of 


P(B(t) > x) = P(B(t) > 2,T <t) 
= P(T < t)P(B(t) — B(T) > O|T <2) 
= P(T < t)/2 
= PUD Ete) > x)/2, 


and the strong Markov property gives the third line above. mg 


Example 7.9 Path dependent stock options. It is most common that 
the payoff from exercising a stock option depends on the price of 
a stock at a fixed point in time. If it depends on the price of a 
stock at several points in time, it is usually called a path dependent 
option. Though there are many formulas for estimating the value of 
various different types of stock options, many path dependent options 
are commonly valued using Monte Carlo simulation. Here we give 


an efficient way to do this using simulation and the self similarity 
property of Brownian motion. 
Let 


Y = f,(B(1), B(2),..., B(n)) 
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be the payoff you get when exercising a path-dependent stock option, 
for standard Brownian motion B(t) and some given function f,, and 
our goal is to estimate E[Y]. 

The process X(t) = exp{at+bB(t)} is called geometric Brownian 
motion with drift, and it is commonly used in finance as a model for a 
stock’s price for the purpose of estimating the value of stock options. 
One example of a path dependent option is the lookback option, 
with payoff function 


Y= max (exp{at + bB(t)} — k)t. 

Another example, the knockout option, is automatically canceled 
if some condition is satisfied. For example, you may have the option 
to purchase a share of stock during period n for the price k, provided 
the price has never gone above a during periods 1 through n. This 
gives a payoff function 


= (exp{an + bB(n)} —k)* x I {max | exp{at + bB(t)} < a}, 


which is also path-dependent. 

The usual method for simulation is to generate Yi, Yo, ..., Yn iid 
~ Y, and use the estimator Y = > 7, % having Var(Y) = 1 1Var(Y). 
The control variates approach, on the other hand, is to find another 
variable X_ with E[X] = 0 and r = corr(X,Y) # 0, and use the 
estimator Y’ = ii (Yi — mX;), where m = r,/Var(Y)/Var(X) is 
the slope of the regression line for predicting Y from X. The quantity 
m is typically estimated from a short preliminary simulation. Since 


Var(Y’) =(1- r)<Var(¥) < Var(Y), 
we get a reduction in variance and less error for the same length 
simulation run. 
Consider the simple example where 
Y = max{B(1), B(2),..., B(100)}, 
and we want to compute E[Y]. For each replication, simulate Y, then 


1. Compute X" = max{B(10), B(20), ..., B(100)} 
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2. Compute 
Xp = V10 max{B(1), B(2),..., B(10)} 
X, = V10(max{B(11), B(12), ..., B(20)} — B(10)) 


Xg = V10(max{B(91), B(92), ..., B(100)} — B(90)) 
notice that self similarity of Brownian motion means that the 
Xj are iid ~ X’. 


: _ 1 9 
3. Use the control variate X = X’ — 75 ojng Xi- 
Your estimate for that replication is Y — mX 


Since E[X] = 0 and X and Y are expected to be highly positively 
correlated, we should get a very low variance estimator. 


(.4 Embedding Variables in Brownian Motion 


Using the fact that both Brownian motion B(t) and (B(t))? —t are 
martingales (we ask you to prove that (B(t))? —t is a martingale in 
the exercises at the end of the chapter) with continuous paths, the 
following stopping theorem can be proven. 


Proposition 7.10 Witha <0 < b andT = inf{t > 0: B(t) =a or 
B(t) = b}, then E[B(T)| = 0 and E[(B(T))?] = E[T}. 


Proof. Since B(n2-™) for n = 0,1,... is a martingale (we ask 
you to prove this in the exercises at the end of the chapter) and 
E||B(2-™)|| < co, we see that for finite stopping times condition (3) 
of Proposition 3.14 (the martingale stopping theorem) holds. If we 
use the stopping time T,, = 2~-"|2"T + 1| this then gives us the 
first equality of 


0 = E[B(min(t, Tm))] + E[B(min(t,T))] > E[B(T)], 


where the first arrow is as m — oo and follows from the dominated 
convergence theorem (using continuity of B(t) and T,, — T to get 
B(min(t, Tm)) —~ B(min(t, T)) and using the bound |B(min(t, Tm))| < 
SUPp<,<; |B(s)|; this bound has finite mean by Proposition 7.8), and 
the second arrow is as t — oo and follows again from the dominated 
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convergence theorem (using continuity of B(t) and min(t,T) — T to 
get B(min(t,T)) — B(T) and also using the bound |B(min(t, T))| < 
b—a), and hence the first part of the result. The argument is similar 
for the second claim of the proposition, by starting with the discrete 
time martingale (B(n2-™))? — n2-™,n =0,1,.... @ 


Proposition 7.11 With the definitions from the previous proposition, 
P(B(T) = a) = b/(b—a) and E[T] = —ab. 
Proof. By the previous proposition 
0 = E[B(T)| = aP(B(T) = a) + W(1 — P(B(T) = a)) 
and 
E{T] = E[(B(T))’] = a’ P(B(T) = a) + °(1 — P(B(T) = a), 
which when simplified and combined give the proposition. m 


Proposition 7.12 Given a random variable X having E[X| = 0 and 


Var(X) = o?, there exists a stopping time T for Brownian motion 
such that B(T) =q X and E[T] = o?. 


You might initially think of using the obvious stopping time T’ = 
inf{t > 0: B(t) = X}, but it turns out this gives E[T] = oo. Here is 
a better approach. 
Proof. We give a proof for the case where X is a continuous random 
variable having density function f, and it can be shown that the 
general case follows using a similar argument. 

Let Y, Z be random variables having joint density function 


gy, 2) = (z—y) f(z) f(y)/E[X*], for y<0<z. 


This function is a density because 
0 oe 0 (oe) 
| [ suedaedy= [few F@s)/ELK* dey 
—oo J0 —oco J0 
0 oe) 
=f foddy | 2f(ede/BIx*) 


foe) 0 
= / f(2)dz / yf (y)dy/E[X*] 
6) —0o 


= P(X <0)+ P(X > 0)E[X~]/E[X+] 
= 1, 
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where we use E[X~| = E|[XT] in the last line. 

Then let T = inf{t > 0: B(t) = Y or B(t) = Z}. We then obtain 
B(T) =q X by first letting x < 0 and using the previous proposition 
in the second line below of 


(oe) 0 
P(B(T) < 2) = i [ _PB(D) < al ¥ = 2 = 2)a(y,2)dude 


© @) Hb z 
a gly, z)dydz 
I La dy 


[BE [so 


= PX sa) [Feat 


= P(X <2), 


and then noticing that a similar argument works for the case where 


x > 0. 
To obtain E[T] = o? note that the previous proposition gives 


E\T| = E[-YZ] 


(oe) 0 
= i iz —yzg(y, z)dydz 


= [. yz(y vy ey) ee dude 


=f» 2 f(a)de+ | a? f (x)dax 


= o°. 


Remark 7.13 It turns out that the first part of the previous result 
works with any martingale M(t) having continuous paths. It can be 
shown, with a < 0 < band T = inf{t > 0: M(t) =a or M(t) = dD}, 
that E[M(T) = 0] and thus we can construct another stopping time 
T as in the previous proposition to get M(T) =q X. We do not, 


however, necessarily get E[T] = o?. 
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7.5 The Central Limit Theorem 


We are now ready to state and prove a generalization of the central 
limit theorem. Since a sequence of independent and identically dis- 
tributed random variables is stationary and ergodic, the central limit 
then follows from the following proposition. 


Proposition 7.14 Suppose X,,X2... is a stationary and ergodic se- 
quence of random variables with F, = o(X1,...,Xn) and such that 
E[X;|Fi-1| = 0 and E[X?|Fi-1] =1. With S, = er X;, then we 
have S,//n —q N(0,1) as n—- oo. 


Proof. By the previous proposition and the strong Markov property, 
there must exist stopping times 7), 7>,... where the D; = Tj11 — T; 
are stationary and ergodic, and where S, = B(T,,) for Brownian 
motion B(t). The ergodic theorem says T,,/m — 1 as., so that 
given € > 0 we have 


N, =min{n:VWm>n, m(1—€) < Im < m(1+.6)} < ©, 
and so 
P(Sp/Vn < 2) = P(B(Th)/Vn < 2) 


< P(B(T,)/V/n < «,Ne <n) + P(N. > 1) 
< PC inf B(n(1 + 6))/V/n <2) + P(N, > 1) 


= P(inf B(1 +8) <2) + P(Ne > n) 


as € — 0 and n — oo and using the fact that B(t) is continuous in the 
last line. Since the same argument can be applied to the sequence 


—X,,—X9,..., we obtain the corresponding lower bound and thus 
conclusion of the proposition. m 
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7.6 Exercises 


1. 


Show that if X(t),t > 0 is a continuous time martingale then 
X(t;),i > 0 is a discrete time martingale whenever t; < te < 
--- < oo are increasing stopping times. 


. If B(t) is standard Brownian motion, show for any a > 0 that 


B(at)//a,t > 0 is a continuous time martingale with station- 
ary independent increments and B(at)//a =q B(t). 


. If B(t) is standard Brownian motion compute Cov(B(t), B(s)). 


. If B(t) is standard Brownian motion, which of the following 


is a continuous time martingale with stationary independent 


increments? (a) V#B(1) (b) B(3t) — B(2t) (c) —B(2t)/V2 


. If B(t) is standard Brownian motion, show that (B(t))? —t and 


B(t) — 3tB(t) are continuous time martingales. 


. If B(t) is standard Brownian motion and T = inf{t > 0: 


B(t) < 0}, compute E|T). 


. Is (N(t))? —AN(t) a martingale when N(t) is a Poisson process 


with rate A? 


. Prove the strong Markov property for Brownian motion as 


follows. (a) First prove for discrete stopping times T using 
the same argument as the strong Markov property for Markov 
chains. (b) Extend this to arbitrary stopping times T < oo 
using the dominated convergence theorem and the sequence of 
stopping times T, = ({2"T | + 1)/2”. (c) Apply the extension 
theorem to show that Brownian motion after a stopping time 
is the same as Brownian motion. 
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